匹配两个文件中的多个列并加入这两个文件

时间:2016-08-08 01:37:22

标签: shell join awk

我想匹配来自两个不同文件的多个列,新输出是file1中的所有列连接在一起,匹配来自文件2或返回' null'如果他们不匹配。文件未分类。这两个文件都是巨大的并且以制表符分隔。

file1 :(我这里只显示几列,有50多列)

chromosome    position    reference    alternate    +50 other columns
1              69511         A           G          other columns
1              69897         G           C          other columns

file2是一个数据库文件((8列):

#CHROM  POS     ID              REF     ALT     QUAL           FILTER  INFO 

1       69511   rs75062661      A       G       120729371.20    PASS    AC=66446;AC_AFR=3767;AC_AMR=5986;AC_Adj=63799;AC_EAS=7618;AC_FIN=3289;AC_Het=1539;AC_Hom=31130;AC_NFE=30553;AC_OTH=437;AC_SAS=12149;AF=0.894;AN=74318;AN_AFR=6394;AN_AMR=6286;AN_Adj=67892;AN_EAS=7622;AN_FIN=3320;AN_NFE=31460;AN_OTH=452;AN_SAS=12358;BaseQRankSum=0.831;ClippingRankSum=1.06;DB;DP=2687838;FS=23.500;GQ_MEAN=224.54;GQ_STDDEV=255.92;Het_AFR=873;Het_AMR=152;Het_EAS=4;Het_FIN=11;Het_NFE=377;Het_OTH=9;Het_SAS=113;Hom_AFR=1447;Hom_AMR=2917;Hom_EAS=3807;Hom_FIN=1639;Hom_NFE=15088;Hom_OTH=214;Hom_SAS=6018;InbreedingCoeff=0.6382;MQ=31.34;MQ0=0;MQRankSum=-4.020e-01;NCC=29303;QD=26.34;ReadPosRankSum=-1.106e+00;VQSLOD=131.28;culprit=FS;DP_HIST=2375|696|240|284|1521|1069|1274|1579|2061|2600|2780|2580|2302|1874|1363|1096|905|839|814|8907,855|552|218|280|521|865|1246|1574|2056|2596|2775|2575|2300|1873|1362|1094|904|839|814|8871;GQ_HIST=945|1293|469|252|189|82|120|127|109|147|156|181|1403|374|268|384|433|374|482|29371,66|523|454|240|147|77|117|126|106|143|156|175|237|309|261|377|432|374|482|29368;CSQ=G|ENSG00000186092|ENST00000335137|Transcript|missense_variant|421|421|141|T/A|Aca/Gca|rs75062661|1||1|OR4F5|HGNC|14825|protein_coding|YES|CCDS30547.1|ENSP00000334393|OR4F5_HUMAN||UPI0000041BC1|tolerated(0.63)|benign(0.003)|1/1||Transmembrane_helices:Tmhmm&Pfam_domain:PF00001&Pfam_domain:PF10320&PROSITE_profiles:PS50262&Superfamily_domains:SSF81321|ENST00000335137.3:c.421A>G|ENSP00000334393.3:p.Thr141Ala|A:0.3480|G:0.33|G:0.65|G:0.87|G:0.70|G:0.544101|G:0.887429|||||||||||

期望的输出:

chromosome    position    reference   alternate  +50 other columns from file1      #CHROM  POS     ID              REF     ALT     QUAL           FILTER   INFO     
1              69511         A         G          other columns                     1      69511   rs75062661      A       G       120729371.20    PASS    AC=66446;AC_AFR=3767;AC_AMR=5986;AC_Adj=63799;AC_EAS=7618;AC_FIN=3289;AC_Het=1539;AC_Hom=31130;AC_NFE=30553;AC_OTH=437;AC_SAS=12149;AF=0.894;AN=74318;AN_AFR=6394;AN_AMR=6286;AN_Adj=67892;AN_EAS=7622;AN_FIN=3320;AN_NFE=31460;AN_OTH=452;AN_SAS=12358;BaseQRankSum=0.831;ClippingRankSum=1.06;DB;DP=2687838;FS=23.500;GQ_MEAN=224.54;GQ_STDDEV=255.92;Het_AFR=873;Het_AMR=152;Het_EAS=4;Het_FIN=11;Het_NFE=377;Het_OTH=9;Het_SAS=113;Hom_AFR=1447;Hom_AMR=2917;Hom_EAS=3807;Hom_FIN=1639;Hom_NFE=15088;Hom_OTH=214;Hom_SAS=6018;InbreedingCoeff=0.6382;MQ=31.34;MQ0=0;MQRankSum=-4.020e-01;NCC=29303;QD=26.34;ReadPosRankSum=-1.106e+00;VQSLOD=131.28;culprit=FS;DP_HIST=2375|696|240|284|1521|1069|1274|1579|2061|2600|2780|2580|2302|1874|1363|1096|905|839|814|8907,855|552|218|280|521|865|1246|1574|2056|2596|2775|2575|2300|1873|1362|1094|904|839|814|8871;GQ_HIST=945|1293|469|252|189|82|120|127|109|147|156|181|1403|374|268|384|433|374|482|29371,66|523|454|240|147|77|117|126|106|143|156|175|237|309|261|377|432|374|482|29368;CSQ=G|ENSG00000186092|ENST00000335137|Transcript|missense_variant|421|421|141|T/A|Aca/Gca|rs75062661|1||1|OR4F5|HGNC|14825|protein_coding|YES|CCDS30547.1|ENSP00000334393|OR4F5_HUMAN||UPI0000041BC1|tolerated(0.63)|benign(0.003)|1/1||Transmembrane_helices:Tmhmm&Pfam_domain:PF00001&Pfam_domain:PF10320&PROSITE_profiles:PS50262&Superfamily_domains:SSF81321|ENST00000335137.3:c.421A>G|ENSP00000334393.3:p.Thr141Ala|A:0.3480|G:0.33|G:0.65|G:0.87|G:0.70|G:0.544101|G:0.887429|||||||||||
1              69897         G         C          other columns                     null   null    null           null    null     null            null    null

文件未分类

这个命令只能给我来自file2的匹配行

awk -F '\t' 'NR==FNR{c[$1$2$3$4]++;next};c[$1$2$4$5]>0' file1 file2

我在这个forumn中找到了join命令,但是这个命令输入了所有列来打印file1中的所有列。我有50多列,因此键入所有列是不实际的,容易出错。

2 个答案:

答案 0 :(得分:1)

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
    if (FNR==1) {
        hdr = $0
        gsub(/[^\t]+/,"null")
        nulls = $0
    }
    else {
        map[$1,$2,$4,$5] = $0
    }
    next
}
{
    if ( FNR==1 ) {
        tail = hdr
    }
    else if ( ($1,$2,$3,$4) in map ) {
        tail = map[$1,$2,$3,$4]
    }
    else {
        tail = nulls
    }
    print $0, tail
}


$ awk -f tst.awk file2 file1
chromosome      position        reference       alternate       +50 other columns       #CHROM  POS     ID      REF     ALTQUAL     FILTER  INFO
1       69511   A       G       other columns 1       69511   rs75062661      A       G       120729371.20    PASS    AC=66446;AC_AFR=3767;AC_AMR=5986;AC_Adj=63799;AC_EAS=7618;AC_FIN=3289;AC_Het=1539;AC_Hom=31130;AC_NFE=30553;AC_OTH=437;AC_SAS=12149;AF=0.894;AN=74318;AN_AFR=6394;AN_AMR=6286;AN_Adj=67892;AN_EAS=7622;AN_FIN=3320;AN_NFE=31460;AN_OTH=452;AN_SAS=12358;BaseQRankSum=0.831;ClippingRankSum=1.06;DB;DP=2687838;FS=23.500;GQ_MEAN=224.54;GQ_STDDEV=255.92;Het_AFR=873;Het_AMR=152;Het_EAS=4;Het_FIN=11;Het_NFE=377;Het_OTH=9;Het_SAS=113;Hom_AFR=1447;Hom_AMR=2917;Hom_EAS=3807;Hom_FIN=1639;Hom_NFE=15088;Hom_OTH=214;Hom_SAS=6018;InbreedingCoeff=0.6382;MQ=31.34;MQ0=0;MQRankSum=-4.020e-01;NCC=29303;QD=26.34;ReadPosRankSum=-1.106e+00;VQSLOD=131.28;culprit=FS;DP_HIST=2375|696|240|284|1521|1069|1274|1579|2061|2600|2780|2580|2302|1874|1363|1096|905|839|814|8907,855|552|218|280|521|865|1246|1574|2056|2596|2775|2575|2300|1873|1362|1094|904|839|814|8871;GQ_HIST=945|1293|469|252|189|82|120|127|109|147|156|181|1403|374|268|384|433|374|482|29371,66|523|454|240|147|77|117|126|106|143|156|175|237|309|261|377|432|374|482|29368;CSQ=G|ENSG00000186092|ENST00000335137|Transcript|missense_variant|421|421|141|T/A|Aca/Gca|rs75062661|1||1|OR4F5|HGNC|14825|protein_coding|YES|CCDS30547.1|ENSP00000334393|OR4F5_HUMAN||UPI0000041BC1|tolerated(0.63)|benign(0.003)|1/1||Transmembrane_helices:Tmhmm&Pfam_domain:PF00001&Pfam_domain:PF10320&PROSITE_profiles:PS50262&Superfamily_domains:SSF81321|ENST00000335137.3:c.421A>G|ENSP00000334393.3:p.Thr141Ala|A:0.3480|G:0.33|G:0.65|G:0.87|G:0.70|G:0.544101|G:0.887429|||||||||||
1       69897   G       C       other columns null    null    null    null    null    null    null    null

答案 1 :(得分:1)

对于join,不需要输入所有50列+,让一些代码输入。假设join的{​​{1}}选项需要从文件#2打印60列。

这是制作重复字符串的一种方法:

-o

输出(略有缩写):

seq -s, 2.01 .01 2.60 | sed 's/\.0/./g'

要使用代码,请将其括在 2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8,2.9,2.10,2.11,2.12, ... 2.58,2.59,2.60 中,(或将其分配给变量),然后使用join:

$()