基于Perl中的正则表达式匹配在句子中标记单词/短语

时间:2013-07-19 04:24:00

标签: regex perl

我有以下句子:

zzzzzzz  microRNA146a xxx (miR-146a, mir-33c) xxxx wwwwww Breast Cancer zzzz mir-33c kkk

我想要做的是标记其中的单词/短语 句子基于一些预定义的正则表达式规则。 最后它看起来像这样:

zzzzzzz  [microRNA146a]<MIR-0> xxx ([miR-146a]<MIR-1>, [mir-33c]<MIR-2>) xxxx wwwwww [Breast Cancer] <CANCER-0> zzzz [mir-33c]<MIR-2> kkk.

请注意,在上面输出的每个单词/短语都符合规则 按其发生的顺序编制索引。

我坚持使用以下代码。什么是正确的方法?

#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $text = 'zzzzzzz   microRNA146a xxx (miR-146a, mir-33c) xxxx wwwwww Breast Cancer zzzz';

# Rule 1 for miRNA definition 
my @mirlist = ($text =~ /( mir-\d+\w+| microRNA\d+)/xgi);

# Rule 2 for special words/phrases
my @spec = ($text =~ /(Breast Cancer)/gi);

# These arrays already preserve the order of occurrence
print Dumper \@mirlist ;
print Dumper \@spec ;

# Not sure how to proceed from here

* 更新: *添加重新发生的miRNA并改进所需的答案。

1 个答案:

答案 0 :(得分:2)

使用您自己的转储和一个简单的for来迭代2个数组:

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;
my $text = 'zzzzzzz   microRNA146a xxx (miR-146a, mir-33c) xxxx microRNA146a wwwwww Breast Cancer aaaa Breast Cancer zzzz mir-33c kkk';

# Rule 1 for miRNA definition 
my $i = 0;
$text =~ s/(mir-\d\w+|microrna\d+\w?)/"[$1]<MIR-" . $i++ . ">"/gie;

# Rule 2 for special words/phrases
my $j = 0;
$text =~ s/(breast cancer)/"[$1]<CANCER-" . $j++ . ">"/gie;

print $text;

Live DEMO.