计算每个唯一ID的出现次数

时间:2012-06-28 15:51:19

标签: perl awk

我是命令行新手。我有一个长文本文件(samp.txt),其中包含以空格分隔的列。 awk / sed / perl帮助表示赞赏。

Id           Pos Re   Va  Cn   SF:R1   SR  He  Ho NC       
c|371443199  22  G     A    R   Pass:8   0   1  0  0       
c|371443199  25  C     A    M   Pass:13  0   0  1  0
c|371443199  22  G     A    R   Pass:8   0   1  0  0        
c|367079424  17  C     G    S   Pass:19  0   0  1  0      
c|371443198  17  G     A    R   Pass:18  0   1  0  0       
c|367079424  17  G     A    R   Pass:18  0   0  1  0 

我想要计算每个唯一ID(计数唯一ID出现次数),计数第6列(第6列=通过),计算He(第8列)和Ho(第9列)多少。我想得到像这样的结果

Id            CountId  Countpass   CountHe CountHO
cm|371443199   3        3          2        1
cm|367079424   2        2          0        2

2 个答案:

答案 0 :(得分:2)

awk '{ids[$1]++; pass[$1] = "?"; he[$1] += $8; ho[$1] += $9} END {OFS = "\t"; print "Id", "CountId", "Countpass", "CountHe", "CountHO"; for (id in ids) {print id, ids[id], pass[id], he[id], ho[id]}' inputfile

分成多行:

awk '{
    ids[$1]++;
    pass[$1] = "?";     # I'm not sure what you want here
    he[$1] += $8; 
    ho[$1] += $9
} 
END {
    OFS = "\t"; 
    print "Id", "CountId", "Countpass", "CountHe", "CountHO"; 
    for (id in ids) {
        print id, ids[id], pass[id], he[id], ho[id]
}' inputfile

答案 1 :(得分:1)

您的输入中似乎有拼写错误,您放置...98而不是...99。假设是这种情况,您的其他信息和预期输出是有意义的。

使用数组存储id以保留id的原始顺序。

use strict;
use warnings;
use feature 'say';    # to enable say()

my $hdr = <DATA>;  # remove header
my %hash;
my @keys;
while (<DATA>) {
    my ($id,$pos,$re,$va,$cn,$sf,$sr,$he,$ho,$nc) = split;
    $id =~ s/^c\K/m/;
    $hash{$id}{he} += $he;
    $hash{$id}{ho} += $ho;
    $hash{$id}{pass}{$sf}++;
    $hash{$id}{count}++;
    push @keys, $id if $hash{$id}{count} == 1;
}
say join "\t", qw(Id CountId Countpass CountHe CountHO);
for my $id (@keys) {
    say join "\t", $id,
        $hash{$id}{count},             # occurences of id
        scalar keys $hash{$id}{pass},  # the number of unique passes
        @{$hash{$id}}{qw(he ho)};
}


__DATA__
Id           Pos Re   Va  Cn   SF:R1   SR  He  Ho NC       
c|371443199  22  G     A    R   Pass:8   0   1  0  0       
c|371443199  25  C     A    M   Pass:13  0   0  1  0
c|371443199  22  G     A    R   Pass:8   0   1  0  0        
c|367079424  17  C     G    S   Pass:19  0   0  1  0      
c|371443198  17  G     A    R   Pass:18  0   1  0  0       
c|367079424  17  G     A    R   Pass:18  0   0  1  0 

<强>输出:

Id      CountId Countpass       CountHe CountHO
cm|371443199    3       2       2       1
cm|367079424    2       2       0       2
cm|371443198    1       1       1       0

注意:我使输出制表符分隔,以便于后期处理。如果你想要它漂亮,可以使用printf来获得一些固定宽度的字段。