从管道分隔文件中删除不以时间戳开头的行的新行字符

时间:2014-05-14 15:59:00

标签: regex sed notepad++ data-cleansing

以下是数据示例:

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210

我需要一个脚本来从不以时间戳开头的行中删除新行字符。在上面的示例中,第2-5行将附加到某种文本blob中第一行的最后一个字段。我知道如何检测好线,

grep '^[0-9][0-9][0-9][0-9].*' testfile

还有坏线,

grep '^[^0-9][^0-9][^0-9][^0-9].*' testfile

现在的问题是,我如何应用它(使用sed?)以便将'good'行后面的行放回到该行的最后一个字段中。这里的任何帮助将不胜感激。

以下是所需输出的示例:

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406 |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603 |PHONE HOME|SDRKRKS|REAS|something|TN 90210

编辑:

对于哪种是最合适的工具存在一些分歧。目前我倾向于记事本++。这接近我想要做的事情,但它不是很有效,也许有人可以帮助我调整它到我的用例:

(?! [0-9]{4}\-[0-9]{2}-[0-9]{2}).*

(?! [0-9]{4}\-[0-9]{2}-[0-9]{2})  - searches for a line not like a timestamp
.*                                  - followed by anything else

问题是。*捕获了我试图否定的时间戳。有什么想法吗?

编辑2: 感谢大家提供的有用建议,这无疑让我朝着正确的方向前进!以下正则表达式在notepad ++中找到了有问题的\ n char,但是当我尝试执行替换时没有任何反应:

Find: (.*)(\n)(?![0-9]{4}\-[0-9]{2}\-[0-9]{2})
Replace: \1

有没有人在这里有任何想法如何强制记事本++删除有问题的\ n?

编辑3: 以下是与建议的解决方案似乎不兼容的其他示例数据:

2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847

5 个答案:

答案 0 :(得分:2)

使用在一个文件中连接的所有已发布的示例输入:

$ cat file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847

$ awk 'NR>1{pre = (/^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ ? ORS : OFS)} {printf "%s%s",pre,$0} END{print ""}' file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR 6:13 AM 6/22/2013 VERIFIED CURLING TN :- 834974978398 XX and YY updated THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr tn 4887839847

如果这不是您的预期输出,请更新您的问题以显示它是什么。

答案 1 :(得分:2)

最简单的解决方案:

echo $(cat file) | sed -re 's/(2013-06)/@@@\1/g' | sed -re 's/@@@/\n/g'

这是因为没有引号的echo将所有内容放在同一行,然后我们在时间戳之前插入@@@并用新行字符替换@@@。

tiago@dell:~$ echo $(cat file) | sed -re 's/(2013-06)/@@@\1/g' | sed -re 's/@@@/\n/g'

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0 
2013-06-22 00:00:49.307121|0950704421406 |PHONE HOME|SDRKRKS|REAS|something|MRS 
2013-06-22 00:00:50.379487|0441813679603 |PHONE HOME|SDRKRKS|REAS|something|TN 90210 
2013-06-22 00:00:02.540298|0238704723874 |SMELL TEST|HAKEKJ |REAS|No cooking|tcna / ncc 
2013-06-22 00:00:04.302887|3289749873342 |SMELL TEST|ICNIDF |REAS|No cooking|JINUJ/CVGIND/NASR 6:13 AM 6/22/2013 VERIFIED CURLING TN :- 834974978398 XX and YY updated THIS IS A SENTENCE 
2013-06-22 00:00:06.937545|30874987392838 |SMELL TEST|KCIDKD |REAS|No cooking|SrutiD/cvgind/nasr tn 4887839847
tiago@dell:~$ cat file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847

答案 2 :(得分:1)

我不确定你喜欢做什么,因为你没有提供输出示例 但是,如果您想连接线路,可以试试这个awk

awk '{printf (!/2013/?" ":RS)"%s",$0} END {print ""}'

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210

答案 3 :(得分:1)

以下是使用GNU sed的一种方式:

sed -nr ':a;N;/\n[0-9]{4}-[0-9]{2}-[0-9]{2}/{P;$!D;s/.*\n//p};s/\n/ /g;$!ba;p' file

说明:

  • 创建标签:a
  • 使用N
  • 将下一行附加到模式空间的当前行
  • /\n[0-9]{4}-[0-9]{2}-[0-9]{2}/{P;$!D;s/.*\n//p}测试附加的行是否以日期开头,如果是,则打印到第一个换行符,如果不是最后一行,则删除第一个新行。如果是最后一行,则删除换行符并打印出来。
  • s/\n/ /g;所有其他行继续删除新行。
  • ba分支回到我们的标签并重复

答案 4 :(得分:1)

这可能适合你(GNU sed):

sed ':a;$!N;/^[^|]*$/Ms/\n/ /;ta' file

如果附加的最后一行不包含|,则用空格替换换行符并重复。