使用公共(未排序)列从两个文件创建新文件

时间:2018-04-23 11:00:19

标签: linux unix join merge

这可能是一个非常基本的问题,但我很难过。

我正在尝试使用公共列从两个大的制表符分隔文件创建一个新文件。这两个文件的头部是:

文件1

k141_1  319     4       0
k141_2  400     9       0
k141_3  995     43      0
k141_4  670     21      0
k141_5  372     8       0
k141_6  359     9       0
k141_7  483     18      0
k141_8  1826    76      0
k141_9  566     15      0
k141_10 462     14      0

file2的

U       k141_1  0
U       k141_11 0
U       k141_24 0
U       k141_30 0
C       k141_32 2       18      77133,212695,487010,    5444279,5444689,68971626,       TIEYSSLHACRSTLEDPT,     cellular organisms; Bacteria;
C       k141_38 1566886 16      1566886,        50380646,       ELVMDREAWCAAIHGV,       cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Corynebacteriales; Mycobacteriaceae; Mycobacterium; Mycobacterium sp. WCM 7299;
U       k141_46 0
C       k141_57 186802  23      1496,1776046,1776047,   64601048,64601468,64601628,64603689,64604310,64605360,71436886,71436980,71437249,71437272,71437295,     CLLYTSDAADDLLCVDLGGRRII,        cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales;
U       k141_64 0
C       k141_73 131567  14      287,305,1496,2209,1483596,      47871795,47873311,47873322,47880313,47880625,53485494,53485498,62558724,71434583,71434608,      LSRGLGDVYKRQIL,SCLVGSEMCIRDRY,YLSLIHISEPTRQE,   cellular organisms;

我希望新文件包含文件1的所有4列和文件2的第8列(由分号分隔的分类信息)。

我试图根据公共列对文件进行排序,但输出结果并不相同,尽管列具有完全相同的值。

例如,

[user@compute02 Contigs]$ sort -k 1 file1 | head
k141_1000       312     253     0
k141_1001       553     13      0
k141_1002       518     19      0
k141_1003       812     30      0
k141_1004       327     13      0
k141_1005       454     18      0
k141_100        595     20      0
k141_1006       1585    78      0
k141_1007       537     23      0
[user@compute02 Contigs]$ sort -k 2 file2 | head
U       k141_1  0
C       k141_1000       305     26      305,    62554095,62558735,      PVSYTHLRAHETRGNLVCRLLLEKKK,     cellular organisms; Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Burkholderiaceae; Ralstonia; Ralstonia solanacearum;
C       k141_1001       946362  11      946362, 5059526,        SGRNGLPLKVR,    cellular organisms; Eukaryota; Opisthokonta; Choanoflagellida; Craspedida; Salpingoecidae; Salpingoeca; Salpingoeca rosetta;
C       k141_1002       131567  15      287,305,2209,1483596,   47870166,47873029,47873592,53485045,55518854,62558495,  RTCLLYTSPSPRDKR,NLSLIHISEPTRQEA,EPVSYTHLRAHETRG,        cellular organisms;
C       k141_100        2       14      287,1496,1776047,       53544868,64603691,71437007,     SRSSAASDVYKRQV, cellular organisms; Bacteria;
U       k141_1003       0
C       k141_1004       2       14      518,1776046,1776047,    28571314,64603094,64605737,     LFFFNDTATTEIYT, cellular organisms; Bacteria;
U       k141_1005       0
C       k141_1006       948     13      948,    73024016,       QAPLSMGFSRQEY,  cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Anaplasmataceae; Anaplasma; phagocytophilum group; Anaplasma phagocytophilum;
C       k141_1007       287     14      287,    50594737,       RRQRQMCIRDRVGS, cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales; Pseudomonadaceae; Pseudomonas; Pseudomonas aeruginosa group; Pseudomonas aeruginosa;

非常感谢任何帮助:)

1 个答案:

答案 0 :(得分:0)

此解决方案应该有效。

for i in `cat file1.txt|awk -F" " '{print $1}'`
do
F1=`grep -w $i file1.txt`
F2=`grep -w $i file2.txt|awk -F" " '{$1=$2=$3=$4=$5=$6=$7=""; print $0}'`
echo $F1 $F2
done
相关问题