Question

我有非常大的.csv文件包含原始数据。许多字段具有前导和尾随空格，并且许多字组/字之间只有一个空格的多字字段值具有额外的空格，例如。

'12   Anywhere  Street'

应该是：

'12 Anywhere Street'

领先，尾随和额外空间从一个额外空间到六个额外空间不等。我可以将文件加载到我的数据库中并运行脚本来修剪它们。前导和尾随修剪脚本运行良好并快速执行;但是，删除单词之间多余空格的脚本更长，更耗时。在将原始.csv文件加载到我的数据库之前，最好使用命令行删除原始.csv文件中的单词之间的额外空格。

我基本上需要运行一个替换函数来替换“”到“”，“”，“”，......的任何实例，最多六个空格左右。我非常感谢为实现这一目标提供的一些帮助。

Answer 1

In Part 1 of this response, I'll first assume that your CSV file has a field separator (say ",") that does NOT occur within any field. In Part 2, I'll deal with the more general case.

Part 1.

awk -F, '
  function trim(s) {
    sub(/^  */,"",s); sub(/  *$/,"",s); gsub(/   */," ",s); return s;
  }
  BEGIN {OFS=FS}
  {for (i=1;i<=NF;i++) { $i=trim($i) }; print }'

Part 2.

To handle the general case, it's best to use a CSV-aware tool (such as Excel or one of the csv2tsv command-line tools) to convert the CSV to a simple format wherein the value-separator does not literally occur within the values. The TSV format (with tab-separated values) is particularly appropriate since it allows a representation of tabs to be included within fields.

Then run the above awk command using awk -F"\t" instead of awk -F,.

To recover the original format, use a tool such as Excel, tsv2csv, or jq. Here is the jq incantation assuming you want a "standard" CSV file:

jq -Rr 'split("\t") | @csv'

In a pinch, the following will probably be sufficient:

awk -F"\t" '
BEGIN{OFS=","; QQ="\"";}
  function q(s)   { if (index(s,OFS)) { return QQ s QQ }; return s}
  function qq(s)  { gsub( QQ, QQ QQ, s); return QQ s QQ }
  function wrap(s) { if (index(s,QQ)) { return qq(s) } return q(s)}
  { s=wrap($1); for (i=2;i<=NF;i++) {s=s OFS wrap($i)}; print s}'

Answer 2

On MacOS or Linux you can do:

cat data.csv | tr -s [:space:] > formatted.csv

This will not trim each value but will remove all duplicate spaces. Maybe this will get you going.

使用命令行修剪csv文件

2 个答案: