使用grep

时间:2018-06-01 13:19:57

标签: grep

我有一个像这个小例子的大文件:

chr1    HAVANA  transcript  69091   70008   .   +   .   gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1    HAVANA  exon    69091   70008   .   +   .   gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1;  exon_id "ENSE00002319515.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1    HAVANA  CDS 69091   70005   .   +   0   gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1;  exon_id "ENSE00002319515.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";

每行以" chr"开头。我想创建一个新文件,其中第3列是" CDS"。我怎样才能进行有条件的grep?我使用了以下代码:

grep -i CDS infile.txt > outfile

但是无论列数如何,这一行都会返回CDS的所有行。你知道怎么解决吗?

我想从小例子中得到这个:

chr1    HAVANA  CDS 69091   70005   .   +   0   gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1;  exon_id "ENSE00002319515.1";  level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";

1 个答案:

答案 0 :(得分:1)

干净的解决方案是使用awk:

显式检查第三列
awk '$3 == "CDS"' infile.txt

对于您的有限样本,看起来其他行上的所有CDS匹配都是较长字的一部分,所以

grep -w 'CDS' infile.txt

也可以通过要求匹配成为确切的单词,但这只是基于您展示的有限样本。

检查第三列的grep解决方案可能如下所示(\s\S\>需要GNU grep):

grep -E '^(\S+\s+){2}CDS\>' infile.txt
符合

或POSIX:

grep -E '^([^[:blank:]]+[[:blank:]]+){2}CDS([[:blank:]]|$)' infile.txt