使用Linux命令行中的复杂CSV

时间:2015-05-18 01:28:40

标签: linux csv awk

我有一个复杂的CSV文件(here是外部链接,因为即使其中的一小部分在SO上看起来不太好),其中特定列可能由多个以空格分隔的列组成。

reset,angle,sine,multiStepPredictions.actual,multiStepPredictions.1,anomalyScore,multiStepBestPredictions.actual,multiStepBestPredictions.1,anomalyLabel,multiStepBestPredictions:multiStep:errorMetric='altMAPE':steps=[1]:window=1000:field=sine,multiStepBestPredictions:multiStep:errorMetric='aae':steps=[1]:window=1000:field=sine
int,string,string,string,string,string,string,string,string,float,float
R,,,,,,,,,,
0,0.0,0.0,0.0,None,1.0,0.0,None,[],0,0
0,0.0314159265359,0.0314107590781,0.0314107590781,{0.0: 1.0},1.0,0.0314107590781,0.0,[],100.0,0.0314107590781
0,0.0628318530718,0.0627905195293,0.0627905195293,{0.0: 0.0039840637450199202    0.03141075907812829: 0.99601593625497931},1.0,0.0627905195293,0.0314107590781,[],66.6556977331,0.0313952597647
0,0.0942477796077,0.0941083133185,0.0941083133185,{0.03141075907812829: 1.0},1.0,0.0941083133185,0.0314107590781,[],66.63923621,0.0418293579232
0,0.125663706144,0.125333233564,0.125333233564,{0.06279051952931337: 0.98942669172932329     0.03141075907812829: 0.010573308270676691},1.0,0.125333233564,0.0627905195293,[],59.9506102238,0.0470076969512
0,0.157079632679,0.15643446504,0.15643446504,{0.03141075907812829: 0.0040463956041429626     0.09410831331851431: 0.94917381047888194    0.06279051952931337: 0.046779793916975114},1.0,0.15643446504,0.0941083133185,[],53.2586756624,0.0500713879053
0,0.188495559215,0.187381314586,0.187381314586,{0.12533323356430426: 0.85789473684210527     0.09410831331851431: 0.14210526315789476},1.0,0.187381314586,0.125333233564,[],47.5170631454,0.0520675034246

要查看我正在使用此技巧column -s,$'\t' -t < *.csv | less -#2 -N -S,这是从Command line CSV viewer借来的升级版本。如果我正在使用这个技巧,那么明确清楚什么是第1个第2个第3列......什么是由特定列中的几个空格分隔数据组成的数据。

我的问题是,是否有任何操作这种复杂的CSV的技巧?我知道我可以使用awk来过滤第5列,然后再从第2列的过滤列过滤器中获取所需的复杂数据部分,但我需要观察第5列之前是否没有其他组合列(所以我需要实际获得第6列而不是第5列等)某些列可能还包含组合和非组合数据的混合。所以awk可能不是正确的工具。

CSV查看器链接提到了一个名为csvlook的工具,该工具将输出管道添加为分隔符。这可能更容易过滤,因为管道将分隔列,而空格将在一列上分隔组合数据。但我无法像csvlook那样使用多个分隔符(逗号和制表符)运行column,因此它无法正确生成数据。处理这个问题的最舒适方式是什么?

2 个答案:

答案 0 :(得分:2)

只要您的输入不包含具有转义嵌入式,字符的列,您就应该能够使用awk解析它,并使用,作为字段分隔符; e.g:

awk -F, '{ n = split($5, subField, "[[:blank:]]+"); for (i=1;i<=n;++i) print subField[i] }' file.csv

上面使用split()函数将第5个字段按空格分成子字段。

答案 1 :(得分:0)

查看cut命令。您可以指定字段列表或一系列字段。