根据列合并两个文件并排序

时间:2016-07-14 19:02:08

标签: c linux perl awk merge

我有两个文件FILE1FILE2,它们的编号不同 列和一些共同的列。在这两个文件中,第一列是 行标识符。我想合并这两个文件(FILE1FILE2) 不改变列的顺序,以及缺少的地方 值输入值' 5'。

例如FILE1(第一列是行ID,A1是第一行,A2 第二,......):

A1 1 2 5 1 
A2 0 2 1 1 
A3 1 0 2 2

FILE1的列名是(这些在另一个文件中指定),

Affy1
Affy3
Affy4
Affy5

也就是说,行A1,列Affy1中的值为1A3,行Affy5中的值为2

     v~~~~~ Affy3
A1 1 2 5 1 
A2 0 2 1 1 
A3 1 0 2 2
   ^~~~ Affy1

同样适用于FILE2

B1 1 2 0
B2 0 1 1
B3 5 1 1

及其列名,

Affy1
Affy2
Affy3

意思是

     v~~~~~ Affy2
B1 1 2 0
B2 0 1 1
B3 5 1 1
   ^~~~ Affy1

我想根据列名合并列并放置一个 ' 5'对于缺失值。所以合并后的结果如下:

A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5 
B2 0 1 1 5 5 
B3 5 1 1 5 5

列:

Affy1
Affy2
Affy3
Affy4
Affy5

也就是说,

     v~~~~~~~ Affy2
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5 
B2 0 1 1 5 5 
B3 5 1 1 5 5
   ^~~~ Affy1

实际上,每个文件中有超过700K列和超过2K行。提前谢谢!

2 个答案:

答案 0 :(得分:0)

当一些标题只出现在一个文件中时,难以对标题进行排序。我知道的最好方法是使用Graph模块构建有向图并对拓扑元素进行排序

一旦完成,只需将每个文件中的值分配到正确的列并用5s填充空白

我已将标题合并为每个数据文件的第一行,因此该程序可以使用此数据

FILE1.TXT

ID Affy1 Affy3 Affy4 Affy5
A1 1 2 5 1 
A2 0 2 1 1 
A3 1 0 2 2

FILE2.TXT

ID Affy1 Affy2 Affy3
B1 1 2 0
B2 0 1 1
B3 5 1 1

这是代码

consolidate_columns.pl

use strict;
use warnings 'all';

use Graph::Directed;

my @files = qw/ file1.txt file2.txt /;

# Make an array of two file handles
#
my @fh = map {
    open my $fh, '<', $_ or die qq{Unable to open "$_" for input: $!};
    $fh;
} @files;

# Make an array of two lists of header names
#
my @file_heads = map { [ split ' ', <$_> ] } @fh;

# Use a directed grapoh to sort all of the header names so thet they're
# still in the order that they were at the top of both files
#
my @ordered_headers = do {

    my $g = Graph::Directed->new;

    for my $f ( 0, 1 ) {
        my $file_heads = $file_heads[$f];
        $g->add_edge($file_heads->[$_], $file_heads->[$_+1]) for 0 .. $#$file_heads-1;
    }

    $g->topological_sort;
};

# Form a hash converting header names to column indexes for output
#
my %ordered_headers = map { $ordered_headers[$_] => $_ } 0 .. $#ordered_headers;

# Print the header and the reformed records from each file. Use the hash to
# convert the header names into column indexes
#
print "@ordered_headers\n";

for my $i ( 0 .. $#fh ) {

    my $fh         = $fh[$i];
    my @file_heads = @{ $file_heads[$i] };
    my @splice     = map { $ordered_headers{$_} } @file_heads;

    while ( <$fh> ) {
        next unless /\S/;

        my @columns;
        @columns[@splice] = split;
        $_ //= 5 for @columns[0 .. $#ordered_headers];

        print "@columns\n";
    }
}

输出

ID Affy1 Affy2 Affy3 Affy4 Affy5
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5

答案 1 :(得分:-1)

为了它的乐趣 - HTH

#!/usr/bin/perl

use warnings;
use strict;
use constant {A => 1, B => 2, BOTH =>3};

# I don't read data from file
my @columns = qw(Affy1 Affy2 Affy3 Affy4 Affy5);
my @locations = (BOTH, B,    BOTH, A,    A);

my @contentA = (["A1", 1, 2, 5, 1],
                ["A2", 0, 2, 1, 1],
                ["A3", 1, 0, 2, 2]);
my @contentB = (["B1", 1, 2, 0],
                ["B2", 0, 1, 1],
                ["B3", 5, 1, 1]);

#I assume both files have the same amount of lines

my @ares  = ();
my @bres = ();
for(my $i = 0; $i < @contentA; ++$i){
  # this uses a lot of memory whith huge amounts of data
  # maybe you write this in two temp result files and cat them
  # together at the end
  # another alternative would be to iterate first over
  # file A and then over file A
  my @row_a = ();
  my @row_b = ();
  push @row_a, shift @{$contentA[$i]}; #id
  push @row_b, shift @{$contentB[$i]}; #id
  foreach my $loc (@locations){
    if(A == $loc){
      push @row_a, shift @{$contentA[$i]};
      push @row_b, 5;
    }
    if(B == $loc){
      push @row_a, 5;
      push @row_b, shift @{$contentB[$i]};
    }
    if(BOTH == $loc){
      push @row_a, shift @{$contentA[$i]};
      push @row_b, shift @{$contentB[$i]};
    }
  }
  push @ares, \@row_a;
  push @bres, \@row_b;
}

foreach my $ar(@ares){
  print join " ", @{$ar};
  print "\n";
}

foreach my $br(@bres){
  print join " ", @{$br};
  print "\n";
}

print join("\n", @columns);
print "\n";