Question

我有一些成功的代码，如果数字位置（来自另一个文件）位于其中，则从FASTA标题中提取范围信息，并作为结果打印正则表达式捕获和原始位置。

文件1样本数据：

7065_8#10   9436    -   t
7065_8#10   126477  -   c
7065_8#10   413711  +   T

文件2样本数据：

>SAEMRSA15_00020 dnaN_DNA_polymerase_III,_beta_chain 2156  3289 forward
ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACATTACCT
>SAEMRSA15_00060 gyrA_DNA_gyrase_subunit_A 7005  9674 forward
ATGTCGGAAAAAGAAATTTGGGA

代码：

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";

open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";

open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";

open( OUTPUTFILE, ">$outputfile" );
my @file1list = ();

while (<FILE1>) {
    if (/^\S+\s+(\d+)/) {
        push @file1list, $1;
    }
}

close FILE1;

while (<FILE2>) {
    if (/^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/) {
        my $cds1 = $1;
        my $cds2 = $2;
        my $cds3 = $3;
        my $cds4 = $4;

        for my $cc22 (@file1list) {
            if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
                print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4\n";
            }
        }
    }
}

close FILE2;

例如输出：

9436 7005 9674 forward

除了这些捕获的信息，我想在匹配后打印FASTA文件的下一行，其中包括该基因的序列数据。我想在原始数据之后将下一行打印在同一行上。这在纸面上听起来非常简单，但我无法理解如何做到这一点！我试图使用之前的答案并将其合并到我的代码中无济于事（如下所示）。如果可能的话，请你调整我的代码，而不是为原始的正确代码建议一个全新的方法，我试着确保我理解我使用的每一个脚本，而不是简单地粘贴在答案中。

期望的输出：

9436 7005 9674 forward ATGTCGGAAAAAGAAATTTGGGA

代码不正确：

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";

open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";

open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";

open( OUTPUTFILE, ">$outputfile" );
my @file1list = ();

while (<FILE1>) {
    if (/^\S+\s+(\d+)/) {
        push @file1list, $1;
    }
}

my $nextline = 0;
close FILE1;

while ( my $line = <FILE2> ) {
    if (/^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/) {
        my $cds1 = $1;
        my $cds2 = $2;
        my $cds3 = $3;
        my $cds4 = $4;

        for my $cc22 (@file1list) {
            if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
                if ($nextline) {
                    print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4 $nextline\n";
                    $nextline = ( $line =~ /^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/ );
                }
            }
        }
    }
}

close FILE2;

提前致谢！

Answer 1

问题是你永远不会给$nextline一个值：

    for my $cc22 (@file1list) {
        if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
            if ($nextline) {
               ...
            }
        }
    }

$nextline中没有任何地方可以设置if ($nextline)，因此永远不会执行$nextline语句。要更改它，您需要更改代码，以便初始化my $nextline; while ( my $line = <FILE2> ) { if ($line =~ /^>(\S+)\s+\S+\s+(\d+)\s+(\d+)\s+(\S+)/) { my $cds1 = $1; my $cds2 = $2; my $cds3 = $3; my $cds4 = $4; # pull in the next line $nextline = <FILE2>; for my $cc22 (@file1list) { if ( $cc22 > $cds2 && $cc22 < $cds3 ) { # print out the first part of the line without a line break # and the next line, which already has the line break on it. print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4 $nextline"; } } } }。由于您尚未发布任何输入数据，因此要确切知道应该做什么有点困难，但假设您在一行上获得匹配，然后想要打印一些细节以及行之后的序列匹配，您可以将代码编辑为以下内容：

7065_8#10   992   -   t
7065_8#10   2264  -   c
7065_8#10   413711  +   T

对文件1使用以下输入：

992 517 1878 forward ATGTCGGAAAAAGAAATTTGGGA
2264 2156 3289 forward ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACATTACCT

输出：

{{1}}

Answer 2

您从$nextline = 0开始。然后，您只更改以if ($nextline)开头的块中的$ nextline。永远不会输入该块，因为$ nextline为0（false）。因此，$ nextline保持为0直到脚本结束。

在FASTA标题中成功匹配后打印下一行

2 个答案: