用于处理CSV

时间:2015-10-16 02:28:25

标签: linux bash csv awk

我有以下CSV文件:

data.csv

Chart #,Ticker,Industry,Last Price,Multiple
2,AFL,Accident & Health Insurance,60.9,0.82
3,UNM,Accident & Health Insurance,32.97,1.52
4,CNO,Accident & Health Insurance,19.33,2.59
2,OMC,Advertising Agencies,71.71,0.7
3,IPG,Advertising Agencies,21.24,2.35
4,ADS,Advertising Agencies,278.18,0.18
2,UPS,Air Delivery & Freight Services,103.8,0.48
3,FDX,Air Delivery & Freight Services,152.11,0.33
4,EXPD,Air Delivery & Freight Services,50.725,0.99
5,CHRW,Air Delivery & Freight Services,72.3,0.69
6,FWRD,Air Delivery & Freight Services,42.86,1.17

我想使用Awk或最好的linux命令行工具来使文件中的日期看起来像这样:

output.txt的

Accident & Health Insurance
2*0.82,3*1.52,4*2.59

Advertising Agencies
2*0.7,3*2.35,4*0.18

Air Delivery & Freight Services
2*0.48,3*0.33,4*0.99,5*0.69,6*1.17

我基本上把所有“图表#”&将它们乘以倍数并在一行上输出“Industry”,所有图表都用逗号分隔,然后是第三行的空格......然后它处理整个列表。

有人能指出我正确的方向如何做到这一点? Awk是否是这项任务的最佳工具,还是我必须创建一个bash脚本来处理它?

2 个答案:

答案 0 :(得分:4)

awk -F, '{a[$3]=a[$3]?a[$3]","$1"*"$NF:$1"*"$NF}END{for(i in a)print i"\n"a[i]}' filename
Air Delivery & Freight Services
2*0.48,3*0.33,4*0.99,5*0.69,6*1.17
Advertising Agencies
2*0.7,3*2.35,4*0.18
Accident & Health Insurance
2*0.82,3*1.52,4*2.59

答案 1 :(得分:4)

$ awk -F, -v OFS='\n' -v ORS='\n\n' '
    NR==1 { next }
    (NR>2) && ($3!=prevKey) { print prevKey, prevRec; prevRec="" }
    { prevKey=$3; prevRec=(prevRec==""?"":prevRec",") $1"*"$NF }
    END { print prevKey, prevRec }
' file
Accident & Health Insurance
2*0.82,3*1.52,4*2.59

Advertising Agencies
2*0.7,3*2.35,4*0.18

Air Delivery & Freight Services
2*0.48,3*0.33,4*0.99,5*0.69,6*1.17

以上与@A-Ray's answer之间的功能差异在于:

  1. Mine假设文件按3美元排序,如您的样本输入所示,而A-Rays则没有。
  2. Mine只在内存中存储与一个$ 3值相关联的输出字符串,而A-Rays一次性存储所有$ 3值的所有输出字符串。
  3. Mine按照输入文件中出现$ 3值的顺序打印输出,而A-Rays将它们打印在"随机" order(它们的索引存储在哈希表中的顺序)。
  4. Mine在输出记录之间打印一个空白行,如预期输出所示,而A-Rays则没有。