如何使用Perl提取paragaph和选定的行?

时间:2010-04-14 10:40:23

标签: perl

我有一个文字,我需要:

  1. 提取整个段落 “Aceview摘要”部分,直到以“请引用”开头的行(不包括在内)。
  2. 提取以“最接近的人类基因”开头的行。
  3. 将它们存储到具有两个元素的数组中。
  4. 文字如下(also on pastebin):

      AceView: gene:1700049G17Rik, a comprehensive annotation of human, mouse and worm genes with mRNAs or ESTsAceView.
    
      <META NAME="title"
     CONTENT="
    AceView: gene:1700049G17Rik a comprehensive annotation of human, mouse and worm genes with mRNAs or EST">
    
    <META NAME="keywords"
     CONTENT="
    AceView, genes, Acembly, AceDB, Homo sapiens, Human,
     nematode, Worm, Caenorhabditis elegans , WormGenes, WormBase, mouse,
     mammal, Arabidopsis, gene, alternative splicing variant, structure,
     sequence, DNA, EST, mRNA, cDNA clone, transcript, transcription, genome,
     transcriptome, proteome, peptide, GenBank accession, dbest, RefSeq,
     LocusLink, non-coding, coding, exon, intron, boundary, exon-intron
     junction, donor, acceptor, 3'UTR, 5'UTR, uORF, poly A, poly-A site,
     molecular function, protein annotation, isoform, gene family, Pfam,
     motif ,Blast, Psort, GO, taxonomy, homolog, cellular compartment,
     disease, illness, phenotype, RNA interference, RNAi, knock out mutant
     expression, regulation, protein interaction, genetic, map, antisense,
     trans-splicing, operon, chromosome, domain, selenocysteine, Start, Met,
     Stop, U12, RNA editing, bibliography">
    <META NAME="Description" 
     CONTENT= "
    AceView offers a comprehensive annotation of human, mouse and nematode genes
     reconstructed by co-alignment and clustering of all publicly available
     mRNAs and ESTs on the genome sequence. Our goals are to offer a reliable
     up-to-date resource on the genes, their functions, alternative variants,
     expression, regulation and interactions, in the hope to stimulate
     further validating experiments at the bench
    ">
    
    
    <meta name="author"
     content="Danielle Thierry-Mieg and Jean Thierry-Mieg,
     NCBI/NLM/NIH, mieg@ncbi.nlm.nih.gov">
    
    
    
    
       <!--
        var myurl="av.cgi?db=mouse" ;
        var db="mouse" ;
        var doSwf="s" ;
        var classe="gene" ;
      //-->
    

    但是我坚持使用以下脚本逻辑。什么是实现这一目标的正确方法?

       #!/usr/bin/perl -w
    
       my  $INFILE_file_name = $file;      # input file name
    
        open ( INFILE, '<', $INFILE_file_name )
            or croak "$0 : failed to open input file $INFILE_file_name : $!\n";
    
    
        my @allsum;
    
        while ( <INFILE> ) {
            chomp;
    
            my $line = $_;
    
            my @temp1 = ();
            if ( $line =~ /^ AceView summary/ ) {
                print "$line\n";
                push @temp1, $line;
            }
            elsif( $line =~ /Please quote/) {
                push @allsum, [@temp1];
                 @temp1 = ();
            }
            elsif ($line =~ /The closest human gene/) {
    
                push @allsum, $line;
            }
    
        }
    
        close ( INFILE );           # close input file
        # Do something with @allsum
    

    我需要处理许多文件。

3 个答案:

答案 0 :(得分:5)

您可以在标量上下文中使用范围运算符来提取整个段落:

while (<INFILE>) {
    chomp;
    if (/AceView summary/ .. /Please quote/) {
        print "$_\n";
    }

    print "$_\n" if /^The closest human gene/;
}

答案 1 :(得分:4)

如果我理解正确的话,你会从http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/av.cgi?db=mouse&c=gene&a=fiche&l=1700049G17Rik那里得到这些信息,这些信息会让我看到一个最可怕的HTML大杂烩(可能与垃圾医疗保险计划发现者的呕吐物并列第一)。

但是,它仍然不匹配HTML::TokeParser::Simple

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new('ace.html');
my ($summary, $closest_human);

while ( my $tag = $parser->get_tag('span') ) {
    next unless $tag->get_attr('class') eq 'hh3';
    next unless $parser->get_text('/span') eq 'AceView summary';
    $summary = $parser->get_text('span');
    $summary =~ s/^\s+//;
    $summary =~ s/\s*Please quote:.*\z//;
    last;
}

while ( my $tag = $parser->get_tag('b') ) {
    $closest_human = $parser->get_text('/b');
    next unless $closest_human eq 'The closest human genes';
    $closest_human .= $parser->get_text('br');
    last;
}

print "=== Summary ===\n\n$summary\n\n";
print "=== Closest Human Gene ==\n\n$closest_human\n"

输出(剪切):

=== Summary ===

Note that this locus is complex: it appears to produce several proteins with no
sequence overlap.
Expression: According to AceView, this gene is well expressed, 
... 
Please see the Jackson Laboratory Mouse Genome Database/Informatics site MGI_192
0680 for in depth functional annotation of this gene.

=== Closest Human Gene ==

The closest human genes, according to BlastP, are the AceView genes ZNF780AandZN
F780B (e=10^-15,), ZNF766 (e=2 10^-15,), ZNF607andZNF781andZFP30 (e=2 10^-14,).

答案 2 :(得分:1)

OTTOMH我用一个简单的状态机来完成这个提取部分。从state = 0开始,在/AceView summary/时将其设置为1,在/Please quote/上将其设置为零。然后,如果$ state == 1,则将$_推送到输出数组。

但我更喜欢尤金的回答。这是Perl,有很多方法可以让你的谚语猫皮肤......