在FASTA标题中成功匹配后打印下一行

时间:2014-10-20 13:01:36

标签: regex perl

我有一些成功的代码,如果数字位置(来自另一个文件)位于其中,则从FASTA标题中提取范围信息,并作为结果打印正则表达式捕获和原始位置。

文件1样本数据:

7065_8#10   9436    -   t
7065_8#10   126477  -   c
7065_8#10   413711  +   T

文件2样本数据:

>SAEMRSA15_00020 dnaN_DNA_polymerase_III,_beta_chain 2156  3289 forward
ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACATTACCT
>SAEMRSA15_00060 gyrA_DNA_gyrase_subunit_A 7005  9674 forward
ATGTCGGAAAAAGAAATTTGGGA

代码:

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";

open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";

open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";

open( OUTPUTFILE, ">$outputfile" );
my @file1list = ();

while (<FILE1>) {
    if (/^\S+\s+(\d+)/) {
        push @file1list, $1;
    }
}

close FILE1;

while (<FILE2>) {
    if (/^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/) {
        my $cds1 = $1;
        my $cds2 = $2;
        my $cds3 = $3;
        my $cds4 = $4;

        for my $cc22 (@file1list) {
            if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
                print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4\n";
            }
        }
    }
}

close FILE2;

例如输出:

9436 7005 9674 forward

除了这些捕获的信息,我想在匹配后打印FASTA文件的下一行,其中包括该基因的序列数据。我想在原始数据之后将下一行打印在同一行上。这在纸面上听起来非常简单,但我无法理解如何做到这一点!我试图使用之前的答案并将其合并到我的代码中无济于事(如下所示)。如果可能的话,请你调整我的代码,而不是为原始的正确代码建议一个全新的方法,我试着确保我理解我使用的每一个脚本,而不是简单地粘贴在答案中。

期望的输出:

9436 7005 9674 forward ATGTCGGAAAAAGAAATTTGGGA

代码不正确:

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";

open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";

open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";

open( OUTPUTFILE, ">$outputfile" );
my @file1list = ();

while (<FILE1>) {
    if (/^\S+\s+(\d+)/) {
        push @file1list, $1;
    }
}

my $nextline = 0;
close FILE1;

while ( my $line = <FILE2> ) {
    if (/^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/) {
        my $cds1 = $1;
        my $cds2 = $2;
        my $cds3 = $3;
        my $cds4 = $4;

        for my $cc22 (@file1list) {
            if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
                if ($nextline) {
                    print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4 $nextline\n";
                    $nextline = ( $line =~ /^>(\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\S+))/ );
                }
            }
        }
    }
}

close FILE2;

提前致谢!

2 个答案:

答案 0 :(得分:2)

问题是你永远不会给$nextline一个值:

    for my $cc22 (@file1list) {
        if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
            if ($nextline) {
               ...
            }
        }
    }

$nextline中没有任何地方可以设置if ($nextline),因此永远不会执行$nextline语句。要更改它,您需要更改代码,以便初始化my $nextline; while ( my $line = <FILE2> ) { if ($line =~ /^>(\S+)\s+\S+\s+(\d+)\s+(\d+)\s+(\S+)/) { my $cds1 = $1; my $cds2 = $2; my $cds3 = $3; my $cds4 = $4; # pull in the next line $nextline = <FILE2>; for my $cc22 (@file1list) { if ( $cc22 > $cds2 && $cc22 < $cds3 ) { # print out the first part of the line without a line break # and the next line, which already has the line break on it. print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4 $nextline"; } } } } 。由于您尚未发布任何输入数据,因此要确切知道应该做什么有点困难,但假设您在一行上获得匹配,然后想要打印一些细节以及行之后的序列匹配,您可以将代码编辑为以下内容:

7065_8#10   992   -   t
7065_8#10   2264  -   c
7065_8#10   413711  +   T

对文件1使用以下输入:

992 517 1878 forward ATGTCGGAAAAAGAAATTTGGGA
2264 2156 3289 forward ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACATTACCT

输出:

{{1}}

答案 1 :(得分:0)

您从$nextline = 0开始。然后,您只更改以if ($nextline)开头的块中的$ nextline。永远不会输入该块,因为$ nextline为0(false)。因此,$ nextline保持为0直到脚本结束。