比较两个文本文件的列

时间:2013-08-02 11:26:32

标签: perl

我有两个文本文件的数据,如:

FILE1.TXT:

contig postion      majorallele minorallele highqualty reliable defin highqualty 
Contig1         479 *   C   0   0   0   0
Contig1         617 T   A   0   0   0   0
Contig15    243 T   C   0   0   0   0
Contig15    471 T   C   0   0   0   0

FILE2.TXT

contig 1 chromosome 0 000000476-044111330
contig 1 chromosome 0 000000477-044111331
contig 1 chromosome 0 000000478-044111332
contig 1 chromosome 0 000000479-044111333
contig 1 chromosome 0 000000480-044111334
contig 1 chromosome 0 000000481-044111335
contig 1 chromosome 0 000000482-044111336
contig 15 chromosome 3 000000242-018378247
contig 15 chromosome 3 000000243-018378248
contig 15 chromosome 3 000000244-018378249
contig 15 chromosome 3 000000245-018378250
contig 15 chromosome 3 000000468-018377016
contig 15 chromosome 3 000000469-018377017
contig 15 chromosome 3 000000470-018377018
contig 15 chromosome 3 000000471-018377019
contig 15 chromosome 3 000000472-018377020
contig 15 chromosome 3 000000473-018377021

我想要做的是将file1.txt的前两列与file2.txt的第一列和第五列进行比较,并将输出返回为:

contig 1 chromosome 0 000000479-044111333 * C   0   0   0   0
contig 15 chromosome 3 000000243-018378248 T    C   0   0   0   0
contig 15 chromosome 3 000000471-018377019 T    C   0   0   0   0

用于合并输出中两个文件的匹配行。

1 个答案:

答案 0 :(得分:0)

你可以简单地使用awk而不是perl。

awk 'FNR==NR && NR!=1
{x=tolower($1);
y=$2;
$1=$2="";
a[x""y]=$0;
next
}{
b=$5;
gsub(/^0*/,"",b);
split(b,c,"-");
if($1$2c[1] in a)print $0,a[$1$2c[1]]}' file1.txt file2.txt

下面测试:

> cat temp1
contig postion      majorallele minorallele highqualty reliable defin highqualty 
Contig1         479 *   C   0   0   0   0
Contig1         617 T   A   0   0   0   0
Contig15    243 T   C   0   0   0   0
Contig15    471 T   C   0   0   0   0
>
> cat temp2
contig 1 chromosome 0 000000476-044111330
contig 1 chromosome 0 000000477-044111331
contig 1 chromosome 0 000000478-044111332
contig 1 chromosome 0 000000479-044111333
contig 1 chromosome 0 000000480-044111334
contig 1 chromosome 0 000000481-044111335
contig 1 chromosome 0 000000482-044111336
contig 15 chromosome 3 000000242-018378247
contig 15 chromosome 3 000000243-018378248
contig 15 chromosome 3 000000244-018378249
contig 15 chromosome 3 000000245-018378250
contig 15 chromosome 3 000000468-018377016
contig 15 chromosome 3 000000469-018377017
contig 15 chromosome 3 000000470-018377018
contig 15 chromosome 3 000000471-018377019
contig 15 chromosome 3 000000472-018377020
contig 15 chromosome 3 000000473-018377021
>
> nawk 'FNR==NR && NR!=1{x=tolower($1);y=$2;$1=$2="";a[x""y]=$0;next}{b=$5;gsub(/^0*/,"",b);split(b,c,"-");if($1$2c[1] in a)print $0,a[$1$2c[1]]}' temp1 temp2
contig 1 chromosome 0 000000479-044111333   * C 0 0 0 0
contig 15 chromosome 3 000000243-018378248   T C 0 0 0 0
contig 15 chromosome 3 000000471-018377019   T C 0 0 0 0
>