我有两个文件FILE1
和FILE2
,它们的编号不同
列和一些共同的列。在这两个文件中,第一列是
行标识符。我想合并这两个文件(FILE1
和FILE2
)
不改变列的顺序,以及缺少的地方
值输入值' 5'。
例如FILE1
(第一列是行ID,A1
是第一行,A2
第二,......):
A1 1 2 5 1
A2 0 2 1 1
A3 1 0 2 2
FILE1
的列名是(这些在另一个文件中指定),
Affy1
Affy3
Affy4
Affy5
也就是说,行A1
,列Affy1
中的值为1
行A3
,行Affy5
中的值为2
v~~~~~ Affy3
A1 1 2 5 1
A2 0 2 1 1
A3 1 0 2 2
^~~~ Affy1
同样适用于FILE2
B1 1 2 0
B2 0 1 1
B3 5 1 1
及其列名,
Affy1
Affy2
Affy3
意思是
v~~~~~ Affy2
B1 1 2 0
B2 0 1 1
B3 5 1 1
^~~~ Affy1
我想根据列名合并列并放置一个 ' 5'对于缺失值。所以合并后的结果如下:
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5
列:
Affy1
Affy2
Affy3
Affy4
Affy5
也就是说,
v~~~~~~~ Affy2
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5
^~~~ Affy1
实际上,每个文件中有超过700K列和超过2K行。提前谢谢!
答案 0 :(得分:0)
当一些标题只出现在一个文件中时,难以对标题进行排序。我知道的最好方法是使用Graph
模块构建有向图并对拓扑元素进行排序
一旦完成,只需将每个文件中的值分配到正确的列并用5s填充空白
我已将标题合并为每个数据文件的第一行,因此该程序可以使用此数据
ID Affy1 Affy3 Affy4 Affy5
A1 1 2 5 1
A2 0 2 1 1
A3 1 0 2 2
ID Affy1 Affy2 Affy3
B1 1 2 0
B2 0 1 1
B3 5 1 1
这是代码
use strict;
use warnings 'all';
use Graph::Directed;
my @files = qw/ file1.txt file2.txt /;
# Make an array of two file handles
#
my @fh = map {
open my $fh, '<', $_ or die qq{Unable to open "$_" for input: $!};
$fh;
} @files;
# Make an array of two lists of header names
#
my @file_heads = map { [ split ' ', <$_> ] } @fh;
# Use a directed grapoh to sort all of the header names so thet they're
# still in the order that they were at the top of both files
#
my @ordered_headers = do {
my $g = Graph::Directed->new;
for my $f ( 0, 1 ) {
my $file_heads = $file_heads[$f];
$g->add_edge($file_heads->[$_], $file_heads->[$_+1]) for 0 .. $#$file_heads-1;
}
$g->topological_sort;
};
# Form a hash converting header names to column indexes for output
#
my %ordered_headers = map { $ordered_headers[$_] => $_ } 0 .. $#ordered_headers;
# Print the header and the reformed records from each file. Use the hash to
# convert the header names into column indexes
#
print "@ordered_headers\n";
for my $i ( 0 .. $#fh ) {
my $fh = $fh[$i];
my @file_heads = @{ $file_heads[$i] };
my @splice = map { $ordered_headers{$_} } @file_heads;
while ( <$fh> ) {
next unless /\S/;
my @columns;
@columns[@splice] = split;
$_ //= 5 for @columns[0 .. $#ordered_headers];
print "@columns\n";
}
}
ID Affy1 Affy2 Affy3 Affy4 Affy5
A1 1 5 2 5 1
A2 0 5 2 1 1
A3 1 5 0 2 2
B1 1 2 0 5 5
B2 0 1 1 5 5
B3 5 1 1 5 5
答案 1 :(得分:-1)
为了它的乐趣 - HTH
#!/usr/bin/perl
use warnings;
use strict;
use constant {A => 1, B => 2, BOTH =>3};
# I don't read data from file
my @columns = qw(Affy1 Affy2 Affy3 Affy4 Affy5);
my @locations = (BOTH, B, BOTH, A, A);
my @contentA = (["A1", 1, 2, 5, 1],
["A2", 0, 2, 1, 1],
["A3", 1, 0, 2, 2]);
my @contentB = (["B1", 1, 2, 0],
["B2", 0, 1, 1],
["B3", 5, 1, 1]);
#I assume both files have the same amount of lines
my @ares = ();
my @bres = ();
for(my $i = 0; $i < @contentA; ++$i){
# this uses a lot of memory whith huge amounts of data
# maybe you write this in two temp result files and cat them
# together at the end
# another alternative would be to iterate first over
# file A and then over file A
my @row_a = ();
my @row_b = ();
push @row_a, shift @{$contentA[$i]}; #id
push @row_b, shift @{$contentB[$i]}; #id
foreach my $loc (@locations){
if(A == $loc){
push @row_a, shift @{$contentA[$i]};
push @row_b, 5;
}
if(B == $loc){
push @row_a, 5;
push @row_b, shift @{$contentB[$i]};
}
if(BOTH == $loc){
push @row_a, shift @{$contentA[$i]};
push @row_b, shift @{$contentB[$i]};
}
}
push @ares, \@row_a;
push @bres, \@row_b;
}
foreach my $ar(@ares){
print join " ", @{$ar};
print "\n";
}
foreach my $br(@bres){
print join " ", @{$br};
print "\n";
}
print join("\n", @columns);
print "\n";