Question

我有400个制表符分隔的文本文件，每个文件有600万行。以下是文件的格式：

 ### input.txt 
col1    col2    col3    col4    col5
ID1     str1    234     cond1   0
ID1     str2    567     cond1   0
ID1     str3    789     cond1   1
ID1     str4    123     cond1   1

### file1.txt
col1    col2    col3    col4    col5
ID2     str1    235     cond1   0
ID2     str2    567     cond2   3
ID2     str3    789     cond1   3
ID2     str4    123     cond2   0

### file2.txt
col1    col2    col3    col4    col5
ID3     str1    235     cond1   0
ID3     str2    567     cond2   4
ID3     str3    789     cond1   1

我试图通过使用以下方法将$ 1中的值从file1..filen的其余部分添加到input.txt文件中的$ 6.

conditions:
1. columns $2 and $3 as key 
2. If the key is found in files1...filen then if $5>=2 add the value from  $1 to $6 in the input file.

代码：

awk -F "\t" -v OFS="\t" '!c {
    c=$0"\tcol6";
    next
}
NR==FNR {
    a[$2$3]=$0 "\t";
    next
}
{
    if ($5>=2) {
        a[$2$3]=a[$2$3] $1 ","
    }
}
END {
     print c;
     for (i in a) {
        print a[i]
    }
}' input.txt file1..filen.txt

上述代码的输出符合预期：

Output.txt
col1    col2    col3    col4    col5    col6
ID1    str2    567    cond1    0    ID2,ID3,
ID1    str4    123    cond1    1    
ID1    str1    234    cond1    0    
ID1    str3    789    cond1    1    ID2,

然而，问题是代码非常慢，因为它必须迭代input.txt中的每个键到400个文件，每个文件中有600万行。这需要几个小时到几天。有人可以建议一种更好的方法来减少awk中的处理时间或使用其他脚本。

任何帮助都会节省很多时间。

Answer 1

input.txt
Sam    string    POS    Zyg    QUAL
WSS    1    125    hom    4973.77
WSS    1    810    hom    3548.77
WSS    1    389    hom    62.74
WSS    1    689    hom    4.12


file1.txt
 Sam   string    POS    Zyg    QUAL
 AC0    1    478    hom    8.64
 AC0    1    583    het    37.77
 AC0    1    588    het    37.77
 AC0    1    619    hom    92.03

 file2.txt
 Sam    string    POS    zyg    QUAL
 AC1    1    619    hom    89.03
 AC1    1    746    hom    17.86
 AC1    1    810    het    2680.77
 AC1    1    849    het    200.77

awk -F "\t" -v OFS="\t" '!c {
        c=$0"\tcol6";
        next
    }
    NR==FNR {
        a[$2$3]=$0 "\t";
        next
    }
    {
        if ( ($5>=2) && (FNR > 1) ) {
          if ( $2$3 in a ) {
             a[$2$3]=a[$2$3] $1 ",";
          } else {
             print $0 > "Errors.txt";
          }
        }
    }
    END {
         print c;
         for (i in a) {
            print a[i]
        }
    }' input.txt file*

对于上述输入文件，它打印以下输出：

AC0,AC1,
WSS    1    389    hom    62.74 
AC1,
WSS    1    810   hom    3548.77    AC1,
WSS    1    689   hom    4.12   
WSS    1    1250      hom    4973.77

它仍会从file1和file2

打印$ 1中的值

如何改进此awk代码以减少处理时间

1 个答案: