如何在csv文件中用\ n替换特定字符串之前的逗号

时间:2019-01-06 12:10:19

标签: bash awk sed

我有一个csv文件,我想在\n之后用GCA_*替换逗号。

输入:

ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1,ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio,ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 

所需的输出:

ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 

我的尝试

sed 's/ASM*/\n&/' ordered_lines_per_genome.csv > assembly_report_table.csv

5 个答案:

答案 0 :(得分:2)

使用GNU sed:

sed 's/\(GCA_[^,]*\),/\1\n/g' input.csv
  • \(GCA_[^,]*\),:匹配GCA*,后跟逗号。 \(...\)定义了一个组,以后可以在替换字符串中使用。
  • 替换\1\n:从匹配项中插入组(“ GCA *”)并添加换行符。

要直接更改文件,请执行以下操作:

sed -i 's/\(GCA_[^,]*\),/\1\n/g' input.csv

或者通过注释修复命令行:

sed 's/ASM[^,]*/\n&/g' input.csv

或更佳:为了防止尾随逗号:

sed 's/,\(ASM[^,]*\)/\n\1/g' input.csv

答案 1 :(得分:2)

您可能正在寻找这个简单的GNU sed

$ sed 's/,/\n/16;P;D' file
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E.coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio
  • s/,/\n/16:用换行符,替换第16个逗号\n
  • P:将行打印到第一个换行符\n
  • D:删除打印的文本,并使用剩余的文本再次开始循环

它基于answer出色的@potong

答案 2 :(得分:2)

您应该删除*并为全局添加g

sed 's/ASM/\n&/g' ordered_lines_per_genome.csv > assembly_report_table.csv

当您不想使用逗号时,可以使用

sed 's/,ASM/\nASM/g' ordered_lines_per_genome.csv > assembly_report_table.csv

为了娱乐,请使用awk:

awk 'BEGIN {RS="ASM"} NF {print "ASM" $0}' ordered_lines_per_genome.csv

如果您不想在行尾使用逗号,则可以使用

awk 'BEGIN {RS="[,]*ASM"} NF {print "ASM" $0}' ordered_lines_per_genome.csv

答案 3 :(得分:0)

awk解决方案:

$ awk -F, '{i=0;while((++i)<=NF)printf $i ((!(i%16) || i==NF)? ORS : ",")}' mb.csv
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio 

类似于mickp's answer,一行16个字段。
如果确定输入文件只有一行,则可以删除前一个i=0;

如果“ ASM”相对唯一,则可以使用自己的方式(以ASM作为行开头):

awk '{print gensub(",ASM","\nASM","g")}' mb.csv

也就是说:

awk '{print gensub(",ASM","\nASM","g")}' ordered_lines_per_genome.csv > assembly_report_table.csv

为您

答案 4 :(得分:0)

使用Perl并假设id以ASM开头。

$ cat maryem.txt
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1,ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio,ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio
$ perl -pe ' s/([^^]ASM.+?,)/\n$1/g; s/^,//mg; ' maryem.txt
ASM190063v1,Escherichia coli(E.coli),strain=D3,562,SAMN03252421,PRJNA269191,Nanjing Agricultural University,2016-12-12,n/a,major,Complete Genome,full,Newbler v. 2.7,30-80x,Illumina Miseq; Roche 454 GS Junior,GCA_001900635.1
ASM301855v1,Escherichia coli (E. coli),strain=2013C-4225,562,SAMN08579596,PRJNA218110,CDC,2018-3-26,n/a,major,Complete Genome,full,HGAP v. 3,yes,76.725x,PacBio
ASM330895v1,Escherichia coli (E. coli),strain=2017C-4109,562,SAMN09534373,PRJNA218110,CDC,2018-7-10,n/a,major,Complete Genome,full,HGAP v. 3,yes,286.7X,PacBio
$