正则表达式作为awk中的字段分隔符

时间:2015-06-15 13:06:41

标签: regex bash awk

我有一个包含586696行和40列的大型数据集。但是,我只对其中一些专栏感兴趣。一个有名字,另一个有数字。

我很难处理此文件中的字段分隔符。所有列分隔符都是空格。如果您认为我的文件名为test.txt并且其中有5个人,则它看起来像这样:

Name Salary
FirstName01 LastName01 Salary01
FirstName02 MiddleName02 LastName02 Salary02
FirstName03 MiddleName03 LastName03 Salary03
FirstName04 LastName04 Salary04
FirstName05 MiddleName05 LastName05 Salary05

因此,如果我跑

awk '{print $1 " " $2}' test.txt

结果是

Name Salary
FirstName01 LastName01
FirstName02 MiddleName02
FirstName03 MiddleName03
FirstName04 LastName04
FirstName05 MiddleName05

但我想要的是

Name Salary
FirstName01 LastName01 Salary01
FirstName02 MiddleName02 LastName02 Salary02
FirstName03 MiddleName03 LastName03 Salary03
FirstName04 LastName04 Salary04
FirstName05 MiddleName05 LastName05 Salary05

为了解决这个问题,假设列Name之前和列Salary之后有列。

如何解决我的问题?我想我必须使用一些正则表达式作为字段分隔符才能在这里使用awk,但我找不到办法来做到这一点。

编辑:我认为我在原帖中并不清楚。我知道awk正在给我我所要求的。我的问题是我的完整数据集类似于

Column01 Column02 Column03 Name Salary Column06 ...
Text0101 Text0102 Text0103 FirstName01 LastName01 Salary01 ...
Text0201 Text0202 Text0203 FirstName02 MiddleName02 LastName02 Salary02 ...
Text0301 Text0302 Text0303 FirstName03 MiddleName03 LastName03 Salary03 ...
Text0401 Text0402 Text0403 FirstName04 LastName04 Salary04 ...
Text0501 Text0502 Text0503 FirstName05 MiddleName05 LastName05 Salary05 ...

鉴于上表,我想要一个可以产生以下结果的awk代码:

Name Salary
FirstName01 LastName01 Salary01
FirstName02 MiddleName02 LastName02 Salary02
FirstName03 MiddleName03 LastName03 Salary03
FirstName04 LastName04 Salary04
FirstName05 MiddleName05 LastName05 Salary05

对我的误导性问题感到抱歉。

3 个答案:

答案 0 :(得分:0)

According to @jas comment: You can check the number of columns with the NF variable in awk. So something like this should do the trick for your test.txt

awk '{name=$4; for (i = 5; i <= NF - 2; i++) name=name " " $i; salary=$i; print name " " salary}' test.txt

This prints the name (starting at column 4) and adds every column up to the third last to the name. The second last column will then be the salary.

Of course you must adjust the values in 'name=$4', 'i = 5' and 'NF - 2' to your needs.

As others pointet out, it would be better to change the algorithm generating the data set in a way such that you get a unique field delimiter.

答案 1 :(得分:0)

Your problem is bad original format! If Name is the only column expanding to multiple fields you can check the number of fields in each row and modify the column selection.

awk 'NR==1{c=NF} {t=$4; for(i=5;i<6+(NF-c);i++) t=t " " $i; print t}' badformat.txt

答案 2 :(得分:0)

如果你的其他“列”都没有包含空格,并且每行中总是有相同数量的“列”,那么接近它的方法是从字段X开始并将字段打印到(NF-Y)。这样,名称的每个“列”中包含多少字段并不重要,因为结束点取决于名称后应保留多少列。

如果您的输入不是那样 - 编辑您的问题以向我们展示它的真实含义!

这似乎适用于您提供的示例输入,但对于您的实际输入可能完全错误,因为您提供的示例不包含实际输入中存在的值,并且在第一个和其他内容之间内部不一致现场位置记录:

$ awk '{e=NF-1; for (i=4;i<=e;i++) printf "%s%s", $i, (i<e?OFS:ORS)}' file
Name Salary
FirstName01 LastName01 Salary01
FirstName02 MiddleName02 LastName02 Salary02
FirstName03 MiddleName03 LastName03 Salary03
FirstName04 LastName04 Salary04
FirstName05 MiddleName05 LastName05 Salary05

上面是在这个输入文件上运行的,该文件的第一行被修改为至少与后续行一致:

$ cat file
Column01 Column02 Column03 Name Salary ...
Text0101 Text0102 Text0103 FirstName01 LastName01 Salary01 ...
Text0201 Text0202 Text0203 FirstName02 MiddleName02 LastName02 Salary02 ...
Text0301 Text0302 Text0303 FirstName03 MiddleName03 LastName03 Salary03 ...
Text0401 Text0402 Text0403 FirstName04 LastName04 Salary04 ...
Text0501 Text0502 Text0503 FirstName05 MiddleName05 LastName05 Salary05 ...