当重复项不在同一列中且在Unix中不是相同的顺序时,删除重复项

时间:2016-08-20 23:25:13

标签: perl unix

AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
NM_003476__CSRP3,AB006589__ESR2,0.45767
NM_012101__TRIM29,AB006589__ESR2,0.45094
NM_006897__HOXC9,AB006589__ESR2,0.41748
NM_000278__PAX2,AB006589__ESR2,0.4161

现在,问题在于第4行

AB006589__ESR2,NM_003476__CSRP3,0.45767

是第8行的副本

NM_003476__CSRP3,AB006589__ESR2,0.45767

在我的大型CSV文件中有很多这种情况。

所以,我的问题是识别所有重复项,并以某种方式删除其中一个。

use strict;

my %hash = ();

open(tf, "tf_tf_mic.csv");

while ( <tf> ) {
    chomp;
#    print "$_\n";                                                                                                    
    my @words = split ",", $_;
    if ( exists $hash{"$words[0]\t$words[1]"} || exists $hash{"$words[1]\t$words[0]"} ) {

    }
    else{
        $hash{"$words[0]\t$words[1]"} = $_;
    }
}

foreach ( keys %hash ) {
    print "$hash{$_}\n";
}

对于400万行文件,这实际上在10秒内工作。

2 个答案:

答案 0 :(得分:1)

您可以在将每行放入哈希值之前重新排序:

  1. ,的每一行拆分为字段:my @fields = split /,/; pop @fields;
  2. 对字段进行排序:@fields = sort @fields;
  3. 将已排序的字段加入新字符串:my $str = join "\t", @fields;
  4. 检查新字符串是否存在于哈希:$hash{$str} = $_ unless exists $hash{$str}

答案 1 :(得分:1)

不需要这种并发症。如果你排序记录中的字段,以便任何给定的值对总是以相同的顺序,那么你可以简单地打印一个记录,如果它的内容之前没有被看到

use strict;
use warnings 'all';

my %seen;

while ( <DATA> ) {
    my @fields = sort /[^,\s]+/g;
    print unless $seen{"@fields[0,1]"}++;
}


__DATA__
AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
NM_003476__CSRP3,AB006589__ESR2,0.45767
NM_012101__TRIM29,AB006589__ESR2,0.45094
NM_006897__HOXC9,AB006589__ESR2,0.41748
NM_000278__PAX2,AB006589__ESR2,0.4161

输出

AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
相关问题