我想从一系列范围值中计算重叠数(#)和百分比(%),这些范围值分布在使用特定标识符(id)启动的四个不同文件中(如NP_111111.4)。最初的id列表取自file1.txt(起始文件),如果id与其他文件的id匹配,则计算重叠。假设我的文件是这样的:
FILE1.TXT
NP_111111.4: 1-9 12-20 30-41
YP_222222.2: 3-30 40-80
FILE2.TXT
NP_111111.4: 1-6, 13-22, 31-35, 36-52
NP_414690.4: 360-367, 749-755
YP_222222.2: 19-24, 22-40
file3.txt
NP_418214.2: 1-133, 135-187, 195-272
YP_222222.2: 1-10
file4.txt
NP_418119.2
YP_222222.2 GO:0016878, GO:0051108
NP_111111.4 GO:0005887
从这些输入文件中,我想创建一个.csv或excel输出,其中包含带标题的单独列:
id overlap_file1_file2(#) overlap_file1_file2(%) overlap_file1_file3(#) overlap_file1_file3(%) overlap_file1_file2_file3(#) overlap_file1_file2_file3(%) Go_Terms(File4)
我正在学习perl,并为这种类型的范围比较找到了一个perl模块“strictures”。我正在计算两个范围的重叠数和百分比:
#!/usr/bin/perl
use strictures;
use Number::Range;
my $seq1 = Number::Range->new(8..356); #Start and stop for file1.txt
my $seq2 = Number::Range->new(156..267); #Start and stop for file2.txt
my $overlap = 0;
my $sseq1 = $seq1->size;
my $percent = (($seq2->size * 100) / $seq1->size);
foreach my $int ($seq2->range) {
if ( $seq1->inrange($int) ) {
$overlap++;
}
else {
next;
}
}
print "Total size= $sseq1 Number overlapped= $overlap Percentage overlap= $percent \n";
但我找不到将(file1.txt)的id与其他文件匹配的方法来提取特定信息并将它们打印在输出csv文件中。
请帮忙。谢谢你的考虑。
答案 0 :(得分:1)
这是一个脆弱的解决方案,因为它只能检查3个文件是否有重叠。如果涉及更多文件,则需要重新构建代码。它使用Set :: IntSpan来计算重叠(以及重叠的百分比。
#!/usr/bin/perl
use strict;
use warnings;
use Set::IntSpan;
use autodie;
my $file1 = 'file1';
my @files = qw/file2 file3/;
my %data;
my %ids;
open my $fh1, '<', $file1;
while (<$fh1>) {
chomp;
my ($id, $list) = split /:\s/;
$ids{$id}++;
$data{$file1}{$id} = Set::IntSpan->new(split ' ', $list);
}
close $fh1;
for my $file (@files) {
open my $fh, '<', $file;
while (<$fh>) {
chomp;
my ($id, $list) = split /:\s/;
next unless exists $ids{$id};
$data{$file}{$id} = Set::IntSpan->new(split /,\s/, $list);
}
close $fh;
}
my %go_terms;
open my $go, '<', 'file4';
while (<$go>) {
chomp;
my ($id, $terms) = split ' ', $_, 2;
$go_terms{$id} = $terms =~ tr/,//dr;
}
close $go;
my %output;
for my $file (@files) {
for my $id (keys %ids) {
my $count = ($data{$file1}{$id} * $data{$file}{$id})->size;
my $percent = sprintf "%.0f", 100 * $count / $data{$file1}{$id}->size;
$output{$id}{$file} = [$count, $percent];
}
}
for my $id (keys %ids) {
my $count = ($data{$file1}{$id} * $data{$files[0]}{$id} * $data{$files[1]}{$id})->size;
my $percent = sprintf "%.0f", 100 * $count / $data{$file1}{$id}->size;
$output{$id}{all_files} = [$count, $percent];
}
# output saved as f2.csv
print join(",", qw/ID f1f2_overlap f1f2_%overlap
f1f3_overlap f1f3_%overlap
f1f2f3_overlap f1f2f3_%overlap Go_terms/), "\n";
for my $id (keys %output) {
print "$id,";
for my $file (@files, 'all_files') {
my $aref = $output{$id}{$file};
print join(",", @$aref), ",";
}
print +($go_terms{$id} // ''), "\n";
}
Excel工作表看起来像这样。