合并具有相似列的两个文件

时间:2012-05-03 18:52:50

标签: perl

我有两个标签分隔的文件,我需要将它们对齐在一起。例如:

File 1:      File 2:
AAA 123      BBB 345
BBB 345      CCC 333
CCC 333      DDD 444

(这些是大文件,可能有数千行!)

我想要做的是让输出看起来像这样:

AAA 123
BBB 345  BBB 345
CCC 333  CCC 333
         DDD 444

最好我想在perl中这样做,但不知道怎么做。任何帮助都会有很大的帮助。

4 个答案:

答案 0 :(得分:1)

如果只是制作数据结构,这可能非常简单。

#!/usr/bin/env perl

# usage: script.pl file1 file2 ...

use strict;
use warnings;

my %data;
while (<>) {
  chomp;
  my ($key, $value) = split;
  push @{$data{$key}}, $value;
}

use Data::Dumper;
print Dumper \%data;

然后您可以以您喜欢的任何格式输出。如果它真的准确地使用文件,那么它有点棘手。

答案 1 :(得分:0)

假设文件已排序,

sub get {
   my ($fh) = @_;
   my $line = <$fh>;
   return () if !defined($line);
   return split(' ', $line);
}

my ($key1, $val1) = get($fh1);
my ($key2, $val2) = get($fh2);

while (defined($key1) && defined($key2)) {
   if ($key1 lt $key2) {
       print(join("\t", $key1, $val1), "\n");
       ($key1, $val1) = get($fh1);
   }
   elsif ($key1 gt $key2) {
       print(join("\t", '', '', $key2, $val2), "\n");
       ($key2, $val2) = get($fh2);
   }
   else {
       print(join("\t", $key1, $val1, $key2, $val2), "\n");
       ($key1, $val1) = get($fh1);
       ($key2, $val2) = get($fh2);
   }
}

while (defined($key1)) {
   print(join("\t", $key1, $val1), "\n");
   ($key1, $val1) = get($fh1);
}

while (defined($key2)) {
   print(join("\t", '', '', $key1, $val1), "\n");
   ($key2, $val2) = get($fh2);
}

答案 2 :(得分:0)

正如池上所说,它假设文件的内容按照你的例子所示排列。

use strict;
use warnings;

open my $file1, '<file1.txt' or die $!;
open my $file2, '<file2.txt' or die $!;

my $file1_line = <$file1>;
print $file1_line;

while ( my $file2_line = <$file2> ) {
    if( defined( $file1_line = <$file1> ) ) {
        chomp $file1_line;
        print $file1_line;
    }

    my $tabs = $file1_line ? "\t" : "\t\t";
    print "$tabs$file2_line";
}

close $file1;
close $file2;

回顾一下您的示例,您会在两个文件中显示一些相同的键/值对。鉴于此,看起来您希望显示文件1唯一的对,文件2唯一,并显示公共对。如果是这种情况(并且您没有尝试通过键或值匹配文件的对),您可以use List::Compare:

use strict;
use warnings;
use List::Compare;

open my $file1, '<file1.txt' or die $!;
my @file1 = <$file1>;
close $file1;

open my $file2, '<file2.txt' or die $!;
my @file2 = <$file2>;
close $file2;

my $lc = List::Compare->new(\@file1, \@file2);

my @file1Only = $lc->get_Lonly; # L(eft array)only
for(@file1Only) { print }

my @bothFiles = $lc->get_intersection;
for(@bothFiles) { chomp; print "$_\t$_\n" }

my @file2Only = $lc->get_Ronly; # R(ight array)only
for(@file2Only) { print "\t\t$_" }

答案 3 :(得分:0)

与Joel Berger的答案类似,但这种方法可以让您跟踪文件是否包含给定密钥:

my %data;

while (my $line = <>){
    chomp $line;
    my ($k)          = $line =~ /^(\S+)/;
    $data{$k}{line}  = $line;
    $data{$k}{$ARGV} = 1;
}

use Data::Dumper;
print Dumper(\%data);

输出:

$VAR1 = {
  'CCC' => {
    'other.dat' => 1,
    'data.dat' => 1,
    'line' => 'CCC 333'
  },
  'BBB' => {
    'other.dat' => 1,
    'data.dat' => 1,
    'line' => 'BBB 345'
  },
  'DDD' => {
    'other.dat' => 1,
    'line' => 'DDD 444'
  },
  'AAA' => {
    'data.dat' => 1,
    'line' => 'AAA 123'
  }
};