我有一些成功的代码,如果数字位置(来自另一个文件)位于其中,则从FASTA标题中提取范围信息,并作为结果打印正则表达式捕获和原始位置。
文件1样本数据:
7065_8#10 9436 - t
7065_8#10 126477 - c
7065_8#10 413711 + T
文件2样本数据:
>SAEMRSA15_00020 dnaN_DNA_polymerase_III,_beta_chain 2156 3289 forward
ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACATTACCT
>SAEMRSA15_00060 gyrA_DNA_gyrase_subunit_A 7005 9674 forward
ATGTCGGAAAAAGAAATTTGGGA
代码:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";
open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";
open( OUTPUTFILE, ">$outputfile" );
my @file1list = ();
while (<FILE1>) {
if (/^\S+\s+(\d+)/) {
push @file1list, $1;
}
}
close FILE1;
while (<FILE2>) {
if (/^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/) {
my $cds1 = $1;
my $cds2 = $2;
my $cds3 = $3;
my $cds4 = $4;
for my $cc22 (@file1list) {
if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4\n";
}
}
}
}
close FILE2;
例如输出:
9436 7005 9674 forward
除了这些捕获的信息,我想在匹配后打印FASTA文件的下一行,其中包括该基因的序列数据。我想在原始数据之后将下一行打印在同一行上。这在纸面上听起来非常简单,但我无法理解如何做到这一点!我试图使用之前的答案并将其合并到我的代码中无济于事(如下所示)。如果可能的话,请你调整我的代码,而不是为原始的正确代码建议一个全新的方法,我试着确保我理解我使用的每一个脚本,而不是简单地粘贴在答案中。
期望的输出:
9436 7005 9674 forward ATGTCGGAAAAAGAAATTTGGGA
代码不正确:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";
open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";
open( OUTPUTFILE, ">$outputfile" );
my @file1list = ();
while (<FILE1>) {
if (/^\S+\s+(\d+)/) {
push @file1list, $1;
}
}
my $nextline = 0;
close FILE1;
while ( my $line = <FILE2> ) {
if (/^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/) {
my $cds1 = $1;
my $cds2 = $2;
my $cds3 = $3;
my $cds4 = $4;
for my $cc22 (@file1list) {
if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
if ($nextline) {
print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4 $nextline\n";
$nextline = ( $line =~ /^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/ );
}
}
}
}
}
close FILE2;
提前致谢!
答案 0 :(得分:2)
问题是你永远不会给$nextline
一个值:
for my $cc22 (@file1list) {
if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
if ($nextline) {
...
}
}
}
$nextline
中没有任何地方可以设置if ($nextline)
,因此永远不会执行$nextline
语句。要更改它,您需要更改代码,以便初始化my $nextline;
while ( my $line = <FILE2> ) {
if ($line =~ /^>(\S+)\s+\S+\s+(\d+)\s+(\d+)\s+(\S+)/) {
my $cds1 = $1;
my $cds2 = $2;
my $cds3 = $3;
my $cds4 = $4;
# pull in the next line
$nextline = <FILE2>;
for my $cc22 (@file1list) {
if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
# print out the first part of the line without a line break
# and the next line, which already has the line break on it.
print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4 $nextline";
}
}
}
}
。由于您尚未发布任何输入数据,因此要确切知道应该做什么有点困难,但假设您在一行上获得匹配,然后想要打印一些细节以及行之后的序列匹配,您可以将代码编辑为以下内容:
7065_8#10 992 - t
7065_8#10 2264 - c
7065_8#10 413711 + T
对文件1使用以下输入:
992 517 1878 forward ATGTCGGAAAAAGAAATTTGGGA
2264 2156 3289 forward ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACATTACCT
输出:
{{1}}
答案 1 :(得分:0)
您从$nextline = 0
开始。然后,您只更改以if ($nextline)
开头的块中的$ nextline。永远不会输入该块,因为$ nextline为0(false)。因此,$ nextline保持为0直到脚本结束。