比较两个文件中的列,并在特定列中打印匹配值

时间:2019-02-17 20:34:45

标签: awk

在以下情况下。我想找到匹配的值: file1:第8和9列 与 file2:第2列和第3列

如果两个文件中的值完全相同,则按照所需的输出文件进行打印

文件1

31429,36689,313212.5,2334362.5,31429,36679,31308,302412.50 2316512.50
31429,36701,313362.5,2334362.5,31429,36681,31311,2334363,31429
31429,36713,313512.5,2334362.5,31429,36719,31358,303312.50 2316512.50
31429,36749,313962.5,2334362.5,31429,36751,31398,2334362,31429
31429,36809,314712.5,2334362.5,31429,36803,31463,2334361,31429
31429,36821,314862.5,2334362.5,31429,36817,31481,2334363,31429

文件2

3000135825 302412.50 2316512.50
3000135837 302562.50 2316512.50
3000135849 302712.50 2316512.50
3000135861 302862.50 2316512.50
3000135873 303012.50 2316512.50
3000135885 303162.50 2316512.50
3000135897 303312.50 2316512.50
3000135909 303462.50 2316512.50
3000135921 303612.50 2316512.50
3000135933 303762.50 2316512.50
3000135945 303912.50 2316512.50

所需的输出

3000135825 302412.50 2316512.50 3667931308 302412.50 2316512.50
3000135897 303312.50 2316512.50 3671931358 303312.50 2316512.50

我尝试了 使用此命令,我得到了结果,但是要花费很多时间,因为file2有300万行,并且代码花费了太多时间 为了能够使用该代码,首先我创建了一个名为tmp1的临时文件,其中有来自file1的5、6、8、9列

awk -F, '{print($5$6,$8,$9)}' file1 > tmp1 

awk 'FNR==NR{a[$2$3]=$0;next}{print $0,a[$2$3]?a[$2$3]:"NA"}' file2 tmp1

4 个答案:

答案 0 :(得分:3)

如果file1的长度远小于file2的长度,则可以缓存file1的内容。

类似的东西(未经测试)

$ awk -F, 'NR==FNR      {a[$8,$9]==$6$7; next}   # is $6$7 the key you want to print?
           ($2,$3) in a {print $1,$2,$3,a[$2,$3]}' file1 FS=' ' file2

因为这些值应该匹配,所以无需再次打印它们。不知道是什么 第四个值打印在输出中,但如果它来自file1,则用它替换。

答案 1 :(得分:1)

出于速度考虑,我会考虑:

1-尽可能使用shell字符串命令

2-仅在文件中放入必要的列

3-排序

4-在变量中存储文件和输出-在大循环中打印和文件命令花费的时间太长

答案 2 :(得分:1)

Could you please try following.

awk 'FNR==NR{a[$8 OFS $9]=$6 $7 OFS $8 OFS $9;next} (($2 OFS $3) in a){print $0,a[$2 OFS $3]}' FS="[, ]"  Input_file1 FS=" " Input_file2

Adding a non-one liner form of solution now.

awk '
FNR==NR{
  a[$8 OFS $9]=$6 $7 OFS $8 OFS $9
  next
}
(($2 OFS $3) in a){
  print $0,a[$2 OFS $3]
}
' FS="[, ]"  Input_file1 FS=" "  Input_file2

Explanation: Adding explanation for above code too now.

awk '
FNR==NR{                              ##Checking condition FNR==NR this will be TRUE when firt Input_file named Input_file1 is being executed.
  a[$8 OFS $9]=$6 $7 OFS $8 OFS $9    ##Creating an array named a whose index is $8 OFS $9 and value if $6 $7 OFS $8 OFS $9.
  next                                ##next keyword is out of the box of awk and will skip further statements from here.
}
(($2 OFS $3) in a){                   ##Statements from here will be executed when 2nd Input_file is being read named Input_file2. Checkingh condition if $2 OFS $3 is present in array a then do following.
  print $0,a[$2 OFS $3]               ##Printing current line along with value of array a whose index is $2 OFS $3.
}                                     ##Closing block for above condition now.
' FS="[, ]" Input_file1 FS=" " Input_file2        ##Setting FS for Input_file1 as comma OR space here then mentioning Input_file1 name then setting FS as space and mentioning Input_file2 name here.

答案 3 :(得分:1)

由于您担心性能,因此请尝试使用此Perl解决方案。

$ perl -lne 'BEGIN{@x=map{chomp;@k=split(/[ ,]/,$_);$kv{"$k[-2] $k[-1]"}="$k[-4]$k[-3]"} qx(cat file1.txt)} /(\S+) (\S+)$/ and $kv{$&} and print $_," ",$kv{$&}, " ",$& ' f
ile2.txt
3000135825 302412.50 2316512.50 3667931308 302412.50 2316512.50
3000135897 303312.50 2316512.50 3671931358 303312.50 2316512.50

$
相关问题