Question

我想解析TopGO R包的InterProScan结果。

我希望文件的格式与我的格式相差无几。

# input file (gene_ID  GO_ID1, GO_ID2, GO_ID3, ....)
Q97R95  GO:0004349, GO:0005737, GO:0006561
Q97R95  GO:0004349, GO:0006561
Q97R95  GO:0005737, GO:0006561
Q97R95  GO:0006561


# desired output (removed duplicates and rows collapsed)
Q97R95  GO:0004349,GO:0005737,GO:0006561

您可以在此处使用整个数据文件测试您的工具：

https://drive.google.com/file/d/0B8-ZAuZe8jldMHRsbGgtZmVlZVU/view?usp=sharing

Answer 1

你可以使用gnu awk的二维数组：

awk -F'[, ]+' '{for(i=2;i<=NF;i++)r[$1][$i]}
         END{for(x in r){
                printf "%s ",x;b=0;
                for(y in r[x]){printf "%s%s",(b?",":""),y;b=1}
                print ""}
         }' file

它给出了：

Q97R95 GO:0005737,GO:0006561,GO:0004349

删除了重复的字段，但未保留订单。

Answer 2

这是一个有希望整洁的Perl解决方案。它尽可能保留键和值的顺序，并且不会将整个文件内容保留在内存中，只需要尽可能多地完成工作。

#!perl
use strict;
use warnings;

my ($prev_key, @seen_values, %seen_values);

while (<>) {
  # Parse the input
  chomp;
  my ($key, $values) = split /\s+/, $_, 2;
  my @values = split /,\s*/, $values;

  # If we have a new key...
  if ($key ne $prev_key) {
    # output the old data, as long as there is some,
    if (@seen_values) {
      print "$prev_key\t", join(", ", @seen_values), "\n";
    }
    # clear it out,
    @seen_values = %seen_values = ();
    # and remember the new key for next time.
    $prev_key = $key;
  }

  # Merge this line's values with previous ones, de-duplicating
  # but preserving order.
  for my $value (@values) {
    push @seen_values, $value unless $seen_values{$value}++;
  }
}

# Output what's left after the last line
if (@seen_values) {
  print "$prev_key\t", join(", ", @seen_values), "\n";
}

根据第1列折叠行

2 个答案: