空间格式化数据到csv

时间:2014-02-07 10:16:33

标签: regex csv awk formatting pretty-print

很长一段时间以来,我一直在尝试将空格分隔数据格式化为CSV结构。

初始位置

初始数据表由下式给出:

Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE    Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment   
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic    Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment   
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center     Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment

它包含大量空间和不必要的信息。信息有点像这样

Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.

我想将其转换为以下格式

Doctor's name,Specialization,Hospital name,Address,Fees,Schedule

所以当前的数据应该是这样的

 Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
 Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM   
 Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM

直到现在我已成功删除了Book Appointment字段。

问题

但是,我在分类医院名称方面遇到了困难。因为它的间距变化很大。这个问题可行吗?

修改

cat -A file的输出如下:

 Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE ^I Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment $
 Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic ^I Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment $
 Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center ^I Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment

2 个答案:

答案 0 :(得分:3)

没有直接的方法将专业化与医院名称分开,但是通过一些假设,您可以使用perl来执行此操作:

perl -pe 's/^(\S+\s+\S+\s+\S+).+experience\s([^\t]+?)\s+(\b[A-Z0-9]{2}[^\t]+?|(?:(?!\b[A-Z0-9]{2})[^\t])*)\s+\t\s+([^,]+,).+?(INR.+?PM)\s+.*/\1,\2,\3,\4\5/' file

给出:

Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250 MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200 MON-SUN10:00AM-8:00PM

由于它是基于perl的正则表达式,因此您可以使用regex101通过正则表达式调试器来了解它的工作原理。正则表达式非常简单,但事实上有很多部分可能会让它看起来令人生畏。

警告:以上内容可以根据两件事分开专业化:

  1. 它试图找到第一个出现的空格,后跟两个大写字符或数字,并在找到时开始匹配作为医院名称;或
  2. 如果没有连续的大写字母或数字,则只需将第一个单词作为专业化,其余单词作为医院名称。
  3. 我知道它可能无法解决完整的问题,因为总有一些行不符合上述规则,但这可以让你开始清理它们。如果有任何错误分离(即,当专业化由超过1个单词组成且医院名称没有两个连续的上/下)时,您将正确放置一个专业化词,其余的在医院名。

答案 1 :(得分:2)

不幸的是,根据您的输入,无法将专业化与医院名称分开。其他字段可以被捕获,虽然不是很优雅并且有gawk(可能> = 4.0,但我认为3.x应该有效):

$ awk -F" \t " -v OFS="," -v S=" " '
{
    sub(/\s+$/, "");
    split($2, Data, /[ ,]{2,}/);
    Address  = Data[1];
    split($2, Data, / +/);
    nData    = length(Data);
    Schedule = Data[nData - 2];
    Fees     = Data[nData - 4] S Data[nData - 3];
    split($1, Data, / +/);
    Name     = Data[1] S Data[2] S Data[3]; # assume all names are Dr. Xxx Xxx only
    match($1, /[0-9]+ years experience /);
    SpecializationHospital = substr($1, RSTART + RLENGTH);
    print Name, SpecializationHospital, Address, Fees, Schedule;
} ' data.txt
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM