(g)在部分空白行上写下一个文件

时间:2017-11-30 15:51:21

标签: awk gawk

问题

我只需要组合一大堆文件并从第一个文件中删除标题(第1行)。

数据

以下是其中三个文件的最后三行(第1行:标题):

"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""

"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""

START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""

问题(续)

正如您所看到的,最后一行在第5列中有一个数字(它是一列总数)。当然,我不想要最后一行。但它(显然)在每个文件中的不同行号上。

(G)awk显然是解决方案,但我不知道(g)awk。

我尝试过什么

我已经尝试了很多组合,但我想我最惊讶的是的工作是:

gawk '
  { if (!$1 ) nextfile }
  NR == 1 {$0 = "Filename" "StartDate" OFS $0; print} 
  FNR > 1 {$0 =  FILENAME StartDate OFS $0; print}
' OFS=',' */*.csv > ../path/file.csv

预期产出(按要求)

"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"

当然,我已经尝试过搜索Google和SO。我看到的大部分答案都需要比我更多的知识,只是为了理解它们。 (我不是数据争夺者,但我有数据争论任务。)

感谢您的帮助!

3 个答案:

答案 0 :(得分:2)

这应该做......

awk 'NR==1; FNR==1{next} FNR>2{print p} {p=$0}' file{1..3}

打印第一个标题,跳过其他标题和最后一行。

答案 1 :(得分:1)

以下内容应该可以解决问题:

 awk -F"," 'NR==1{header=$0; print $0} $0!=header && $1!=""{print $0}' */*.csv > ../path/file.csv\

这里awk会:

  1. 按逗号-F","
  2. 拆分记录
  3. 如果这是awk遇到的第一条记录,它会将变量header设置为该行的整个内容,然后打印标题NR==1{header=$0; print $0}
  4. 如果当前行的内容不是标题且第一个字段不为空(表示“总”行),则打印行$0!=header && $1!=""{print $0}'
  5. 正如我在下面的评论中所提到的,如果您的记录的第一个字段始终以8位数日期开头,那么您可以简化(这不像上面的代码那样通用):

     awk -F"," 'NR == 1 || $1 ~ /"[0-9]{8}"/ {print $0} /*.csv > outfile.csv
    

    基本上,如果这是第一个要处理的记录然后打印它(它是标题)或||如果第一个字段是由双引号括起的8位数字然后打印它。

答案 2 :(得分:1)

另一种 awk 方法: -

awk -F, '
        NR == 1 {
                header = $0
                print
                next
        }
        FNR > 1 && $1 != "\"\""
' *.csv