Question

我试图找到一个很好的方法来实现这一点，但不幸的是我找不到一个。

我正在使用这种格式的文件：

=群集=
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22491.xml; spectrum = 1074 true
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22498.xml; spectrum = 2950 true

=群集=
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22498.xml; spectrum = 1876 true
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22498.xml; spectrum = 3479 true
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22498.xml; spectrum = 3785 true

=群集=
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22493.xml; spectrum = 473 true
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22493.xml; spectrum = 473 true

如您所见，每条SPEC线都不同，除了最后一条，其中重复了弦谱的数量。我想做的是在模式=Cluster=之间获取每一块信息，并检查是否有重复频谱值的行。如果重复多行，则删除除一行之外的所有行。

输出文件应如下所示：

=群集=
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22491.xml; spectrum = 1074 true
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22498.xml; spectrum = 2950 true

=群集=
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22498.xml; spectrum = 1876 true
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22498.xml; spectrum = 3479 true
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22498.xml; spectrum = 3785 true

=群集=
  SPEC PRD000681; PRIDE_Exp_Complete_Ac_22493.xml; spectrum = 473 true

我使用此模式使用模式拆分文件，但我不知道如何检查是否有重复的光谱。

#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?==Cluster=)/)) {
      open(O, '>temp' . ++$n);
      print O $match;
      close(O);
}

PD：我使用Perl因为它对我来说更容易，但我也理解python。

Answer 1

这样的东西会删除重复的行（整个文件）。

#!/usr/bin/perl

use warnings;
use strict;

my %seen; 

while ( <> ) {
  next if ( m/SPEC/ and $seen{$_}++ );
  print;
}

如果您想更具体地了解频谱值，例如：

next if ( m/spectrum=(\d+)/ and $seen{$1}++ );

当你拆分群集时，你可以做一些非常相似的事情，但只是：

  if ( $line =~ m/==Cluster==/ ) { 
     open ( $output, ">", "temp".$count++ ); 
     select $output;
  }

这会将默认的“打印”位置设置为$output（您还需要在循环外声明它。

您还应该：

use strict; use warnings;
避免将<>读入$_，这是不必要的。但是，如果你不得不改为$block = do { local $/; <> };，那通常会更好。然后$block =~ m/regex/
使用词汇文件句柄：open ( my $output, '>', 'filename' ) or die $!;
在打开时检查您的返回代码（or die $!通常就足够了）。

这就像是：

#!/usr/bin/perl

use warnings;
use strict;

my %seen; 
my $count = 0; 
my $output; 

while (  <> ) {
  next if ( m/spectrum=(\d+)/ and $seen{$1}++ );
  if ( m/==Cluster==/ ) { 
     open ( $output, ">", "temp".$count++ ) or die $!; 
     select $output;
  }
  print;
}

Answer 2

您也可以使用我在bool_array = numpy.in1d(array1, array2)模块中使用python的{{1}}脚本。

我假设您的输入文件名为groupby，输出文件名为itertools。

f_input.txt

输出文件new_file.txt与您想要的输出类似。

Answer 3

如果重复的行是连续的，你可以使用这个perl oneliner：

perl -ani.back -e 'next if defined($p) && $_ eq $p;$p=$_;print' file.txt

原始文件是扩展名为.back

的备份

Answer 4

任务看起来很容易，不需要perl / python：使用uniq命令删除相邻的重复行：

$ uniq < input.txt > output.txt

根据模式

4 个答案: