Question

我试图理解以下用于使用BASH在多个文件上拉出重叠行的代码。

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) 
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ($0), concatenating 
  # the filenames separated by / as values
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }' file[a-d]

在理解每个代码子块正在做什么之后，我想扩展此代码以查找具有重叠的特定字段，而不是整行。例如，我尝试更改该行：

n = split(rec[R], t, "/")

到

n = split(rec[R$1], t, "/")

找到所有文件中第一个字段相同的行，但这不起作用。最后我想扩展它以检查一行是否有相同的字段1,2和4，然后打印该行。

具体来说，对于链接中示例中提到的文件：如果文件1是：

chr1    31237964    NP_055491.1    PUM1    M340L
chr1    33251518    NP_037543.1    AK2    H191D

和文件2是：

chr1    116944164    NP_001533.2    IGSF3    R671W
chr1    33251518    NP_001616.1    AK2    H191D
chr1    57027345    NP_001004303.2    C1orf168    P270S

我想退出：

file1/file2 --> chr1    33251518    AK2    H191D

我在以下链接中找到了此代码： http://www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738。具体来说，我想了解R，rec，n，dup和D代表文件本身的含义。从提供的评论中我不清楚，我在subloops中添加的printf语句失败了。

非常感谢您对此有任何见解！

Answer 1

该脚本的工作原理是构建一个辅助数组，其索引是输入文件中的行（$0中用rec[$0]表示），值为filename1/filename3/...存在给定行$0的文件名。您可以将其修改为只使用$1，$2和$4，如下所示：

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) {
        split(R,R1R2R4,SUBSEP)
        dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
          sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
      }
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the partial content of the current record
  # (special concatenation of $1, $2 and $4)
  # concatenating the filenames separated by / as values
  rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
  }' file[a-d]

此解决方案使用multidimensional arrays：我们创建rec[$1,$2,$4]而不是rec[$0]。 awk的这种特殊语法将索引与SUBSEP字符连接在一起，默认情况下是不可打印的（"\034"是准确的），因此它不可能属于任何一个田野。实际上它是rec[$1 SUBSEP $2 SUBSEP $4]=...。否则这部分代码是相同的。请注意，将第二个块移动到脚本的开头并使用END块结束更合乎逻辑。

代码的第一部分也必须改变：现在for (R in rec)遍历这些棘手的连接索引$1 SUBSEP $2 SUBSEP $4。索引时这很好，但您需要split R个SUBSEP个字符再次获取可打印字段$1，$2，{{1} }。这些被放入数组$4，可用于打印必要的输出：而不是R1R2R4我们现在有%s,...,R。实际上，我们使用预先保存的字段%s\t%s\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3],，sprintf ...%s,...,$1,$2,$4;，$1执行$2。对于您的输入示例，这将打印

$4

请注意，输出缺失records found in 2 files: foo11.inp1/foo11.inp2 --> chr1 33251518 AK2但正确如此：不在字段1,2或4中（而是在字段5中），因此并不保证它是在打印文件中相同！您可能不想打印它，或者无论如何必须指定如何处理未在文件之间检查的列（因此可能不同）。

对原始代码的一些解释：

H191D是一个数组，其索引是输入的整行，值是斜线分隔的文件列表，这些行显示在这些文件中。例如，如果rec包含一行＆＃34; file1＆＃34;，则最初为foo bar。如果rec["foo bar"]=="file1"也包含此行，则file2。请注意，没有检查多重性，因此如果rec["foo bar"]=="file1/file2"包含此行两次，那么最终您将获得file1并获得包含此行的文件数量的3。
rec["foo bar"]=file1/file1/file2在完全构建后遍历数组R的索引。这意味着rec最终将假定每个输入文件的每个唯一行，允许我们遍历R，其中包含特定行rec[R]所在的文件名。
R是来自n的返回值，它将split的值 - 即与行rec[R]对应的文件名列表 - 分开削减。最终数组R填充了文件列表，但我们没有使用它，我们只使用数组t的长度，即行中的文件数存在t（保存在变量R中）。如果n，我们什么都不做，只有在存在多重性时才会这样做。
n==1上的循环根据给定行的多重性创建类。 n适用于恰好存在于2个文件中的行。对于那些出现三次的人来说n==2，依此类推。这个循环的作用是它构建一个数组n==3，它为每个多重类（即每个dup）创建输出字符串n，每个字符串由{{分隔开来1}}（记录分隔符）"filename1/filename2/... --> R"的每个值在文件中显示RS次总计。因此，给定R的最终n将包含dup[n]形式的给定数量的字符串，与n字符连接（默认为换行符）。
然后"filename1/filename2/... --> R"上的循环将经历多重类（即RS的有效值大于1），并打印每个D in dup中收集的输出行{{} 1}}。由于我们仅为n定义了dup[D]，如果存在多重性，D从2开始（或者，如果没有，则dup[n]为空，并且n>1上的循环不会做任何事情。）

Answer 2

首先，您需要了解AWK脚本中的3个块：

BEGIN{
# A code that is executed once before the data processing start
}

{
# block without a name (default/main block)
# executed pet line of input
# $0 contains all line data/columns
# $1 first column
# $2 second column, and so on..
}

END{
# A code that is executed once after all data processing finished
}

因此您可能需要编辑脚本的这一部分：

  {  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ($0), concatenating 
  # the filenames separated by / as values
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }

从多个文件

2 个答案: