使用perl在列表成员中查找子字符串

时间:2014-08-28 10:30:19

标签: perl

通过在下面的方法中找到成员中的子字符串,尝试找出过滤长数组的最快方法:

  • $str =~ /\.xml/ - 找到" .xml"字符串中的某个地方
  • $str =~ /\.xml$/ - 找到" .xml"在字符串的末尾
  • substr($str,-4) eq ".xml" - 最后4个字符" .xml"?
  • rindex($str, ".xml") - 任何" .xml"
  • 的出现
  • (length($str) - rindex($str,".xml")) == 4 - 最后4个字符" .xml"?

我使用while/if/pushgrep内部使用下一个代码(更新了评论中的想法)尝试了以上所有内容

use 5.016;
use warnings;
use Benchmark qw(:all);

my $nmax = 5_000_000;
my @list = map { sprintf "a%s.%s", int(rand(100000000)), (int(rand(2))%2?"txt":"xml") } 1..$nmax;

cmpthese(10, {
        'whl_match'    => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml/  )}; },
        'whl_matchend' => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml$/ )}; },
        'whl_matchendz'=> sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( $x =~ /\.xml\z/ )}; },
        'whl_substr'   => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( substr($x,-4) eq ".xml" )}; },
        'whl_rindex'   => sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if( rindex($x,".xml") >= 0 )}; },
        'whl_lenrindex'=> sub { my @xml; while(my ($i, $x) = each @list) { push(@xml, $x) if((length($x)-rindex($x,".xml"))==4)};},

        'for_match'    => sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml/  )}; },
        'for_matchend' => sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml$/ )}; },
        'for_matchendz'=> sub { my @xml; for my $x (@list) { push(@xml, $x) if( $x =~ /\.xml\z/ )}; },
        'for_substr'   => sub { my @xml; for my $x (@list) { push(@xml, $x) if( substr($x,-4) eq ".xml" )}; },
        'for_rindex'   => sub { my @xml; for my $x (@list) { push(@xml, $x) if( rindex($x,".xml") >= 0 )}; },
        'for_lenrindex'=> sub { my @xml; for my $x (@list) { push(@xml, $x) if((length($x)-rindex($x,".xml"))==4)};},

        'grp_match'    => sub { my @xml = grep { /\.xml/ }  @list; },
        'grp_matchend' => sub { my @xml = grep { /\.xml$/ } @list; },
        'grp_matchendz'=> sub { my @xml = grep { /\.xml\z/ } @list; },
        'grp_substr'   => sub { my @xml = grep { substr($_,-4) eq ".xml" } @list; },
        'grp_rindex'   => sub { my @xml = grep { rindex($_,".xml") >= 0 } @list; },
        'grp_lenrindex'=> sub { my @xml = grep { (length($_) - rindex($_,".xml")) == 4 } @list; },
});

我的鲤鱼笔记本上的结果。

              s/iter whl_matchend whl_matchendz grp_matchendz grp_matchend whl_lenrindex whl_match whl_substr grp_match whl_rindex for_matchend for_matchendz for_lenrindex for_match grp_lenrindex for_substr for_rindex grp_substr grp_rindex
whl_matchend    4.48           --           -0%          -10%         -12%          -17%      -21%       -24%      -25%       -32%         -47%          -47%          -67%      -70%          -70%       -73%       -73%       -77%       -78%
whl_matchendz   4.46           0%            --           -9%         -11%          -17%      -21%       -23%      -25%       -32%         -46%          -46%          -67%      -69%          -70%       -73%       -73%       -76%       -78%
grp_matchendz   4.05          11%           10%            --          -2%           -9%      -13%       -15%      -17%       -25%         -41%          -41%          -63%      -66%          -67%       -70%       -70%       -74%       -76%
grp_matchend    3.96          13%           13%            2%           --           -6%      -11%       -14%      -15%       -24%         -40%          -40%          -62%      -66%          -66%       -70%       -70%       -73%       -75%
whl_lenrindex   3.70          21%           21%            9%           7%            --       -5%        -8%       -9%       -18%         -35%          -35%          -60%      -63%          -64%       -67%       -67%       -72%       -73%
whl_match       3.53          27%           27%           15%          12%            5%        --        -3%       -5%       -14%         -32%          -32%          -58%      -61%          -62%       -66%       -66%       -70%       -72%
whl_substr      3.42          31%           30%           18%          16%            8%        3%         --       -2%       -12%         -30%          -30%          -57%      -60%          -61%       -65%       -65%       -69%       -71%
grp_match       3.36          33%           33%           20%          18%           10%        5%         2%        --       -10%         -29%          -29%          -56%      -59%          -60%       -64%       -64%       -69%       -71%
whl_rindex      3.02          48%           48%           34%          31%           22%       17%        13%       11%         --         -21%          -21%          -51%      -55%          -56%       -60%       -60%       -65%       -67%
for_matchend    2.40          87%           86%           69%          65%           55%       47%        43%       40%        26%           --           -0%          -38%      -43%          -44%       -50%       -50%       -56%       -59%
for_matchendz   2.39          87%           87%           69%          65%           55%       47%        43%       40%        26%           0%            --          -38%      -43%          -44%       -50%       -50%       -56%       -59%
for_lenrindex   1.49         201%          200%          172%         166%          149%      137%       130%      126%       103%          61%           61%            --       -8%          -10%       -19%       -19%       -29%       -33%
for_match       1.36         229%          227%          197%         191%          172%      159%       151%      146%       122%          76%           76%            9%        --           -2%       -11%       -12%       -23%       -27%
grp_lenrindex   1.33         237%          236%          204%         198%          178%      165%       157%      153%       127%          80%           80%           12%        2%            --        -9%        -9%       -21%       -26%
for_substr      1.21         271%          270%          235%         228%          207%      192%       184%      178%       150%          98%           98%           23%       13%           10%         --        -0%       -13%       -18%
for_rindex      1.20         272%          271%          236%         229%          208%      193%       184%      179%       151%          99%           99%           23%       13%           10%         0%         --       -13%       -18%
grp_substr      1.05         326%          325%          285%         277%          252%      235%       226%      220%       188%         128%          128%           41%       30%           27%        15%        15%         --        -6%
grp_rindex     0.990         352%          351%          309%         300%          274%      256%       246%      239%       205%         142%          142%           50%       38%           34%        22%        22%         6%         --

我多次重复测试,总是得到上述顺序。


问题1。

正如我所预料的那样,grep的速度与while/if/push相似,但下一个让我感到惊讶:

比较

              s/iter
whl_matchend    4.54
grp_matchend    3.98

grep只有略快与类似的while/if/push一样。

为什么例如在下一个:

whl_substr      3.23
grp_substr      1.05

grep 快3倍 while/if/push。那么,grepwhile/if/push执行substr的速度快{3}},而/regex-match/执行grep {/regex/}的速度快{3}}同样,这可以看作任何"字符串操作"。

换句话说, while/if/push 只有轻微的速度增加 grep {substr} $str =~ /\.xml/ 巨大速度提升。的为什么


问题2

另一个惊喜(至少对我而言)是下一个:为什么$str =~ /\.xml$/$更快?我期望,而不是指定use 5.016; use warnings; use Benchmark qw(:all); my $str = "a38877283.xml"; cmpthese(10, { 'match' => sub { $str =~ /\.xml/ for (1..5_000_000) }, 'matchend' => sub { $str =~ /\.xml$/ for (1..5_000_000) }, 'matchendz' => sub { $str =~ /\.xml\z/ for (1..5_000_000) }, #updated the \z }); 将加速rexex,因为不需要在整个字符串中搜索 - 但这是一个错误的假设,正如下一个测试的那样:

perl 5, version 20, subversion 0 (v5.20.0) built for darwin-2level

代表 s/iter matchend matchendz match matchend 2.32 -- -1% -64% matchendz 2.30 1% -- -63% match 0.844 175% 173% -- (perlbrew)

perl 5, version 16, subversion 2 (v5.16.2) built for darwin-thread-multi-2level

使用: Rate matchendz match matchend matchendz 0.405/s -- -69% -70% match 1.29/s 218% -- -5% matchend 1.36/s 235% 5% -- (默认OS X)

Darwin jabko.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64

perl更快。 ;)

操作系统:

qr

最后一个问题

  • 仍未测试预编译正则表达式{{1}}的效果 - 任何其他想法可以成为最快的过滤器?

1 个答案:

答案 0 :(得分:2)

关于上一个问题,请尝试\z而不是$,因为\z匹配字符串的结尾,而$也会查找可选的尾随换行符(perldoc perlre)。

use Benchmark qw(:all);
my $str = "a38877283.xml";
cmpthese(10, {
    'match'    => sub { $str =~ /\.xml/  for (1..5_000_000) },
    'matchend' => sub { $str =~ /\.xml$/ for (1..5_000_000) },
    'matchend2' => sub { $str =~ /\.xml\z/ for (1..5_000_000) },
});

输出

             Rate matchend2  matchend     match
matchend2 0.473/s        --      -58%      -59%
matchend   1.14/s      140%        --       -1%
match      1.15/s      143%        1%        --