为什么匹配运算符不匹配任何内容?

时间:2009-11-01 22:26:31

标签: html perl

我正在尝试解析此HTML块:

<div class="v120WrapperInner"><a href="/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&amp;adtype=pyv&amp;event=ad&amp;usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=" title="The Valley Downs Chicago"><img class="vimg120" alt="The Valley Downs Chicago" src="http://i2.ytimg.com/vi/91sYT_8CN8Q/1.jpg">

捕获重定向链接:

/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&amp;adtype=pyv&amp;event=ad&amp;usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=

和视频标题:

The Valley Downs Chicago

当我使用这个简单的Perl代码时:

foreach $_ (@promotedVideos)
{
   if (/\s<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/six)
   {
     print $1;
     print $2;
   }
}
什么都不打印。虽然我正在对此进行故障排除,但我认为如果您发现任何错误或有问题,我会问您专家。非常感谢您的帮助!

4 个答案:

答案 0 :(得分:7)

你的/ x正则表达式修饰符会混淆空格。删除它。

也就是说,它应该是

if (/\s<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/si)

/ x使perl忽略正则表达式中的空格,使你的正则表达式相当于以下内容:

/\s<divclass="v120WrapperInner"><a href="([^"]*)"title="([^"]*)"><img/six

不匹配。

开头的那个也可能会制造东西。

这是我用于测试的代码:

use strict;


my $inp = '<div class="v120WrapperInner"><a href="/redirect?q=http%3A%2F%2Fwww.google.com%2Faclk%3Fsa%3DL%26ai%3DCKJh--O7tSsCVIKeyoQTwiYmRA5SnrIsB1szYhg2d2J_EAhABIJ7rxQ4oA1CLk676B2DJntmGyKOQGcgBAaoEFk_Qyu5ipY7edN5ETLuchKUCHbY4SA#0%26num%3D1%26sig%3DAGiWqtwtAf8NslosN7AuHb7qC7RviHVg7A%26q%3Dhttp%3A%2F%2Fwww.youtube.com%2Fwatch%253Fv%253D91sYT_8CN8Q%2526feature%253Dpyv%2526ad%253D3409309746%2526kw%253Dsusan%25252#0boyle&amp;adtype=pyv&amp;event=ad&amp;usg=bR7ErKA_3szWtQMGe2lt1dpxzHc=" title="The Valley Downs Chicago"><img class="vimg120" alt="The Valley Downs Chicago" src="http://i2.ytimg.com/vi/91sYT_8CN8Q/1.jpg">';

print "$inp\n";

if ( $inp =~ /<div class="v120WrapperInner"><a href="([^"]*)" title="([^"]*)"><img/si )
{
 print "m:\n$1\n$2\n";
}

答案 1 :(得分:3)

好的,这不完全是你问的问题,但我认为(基于这个和你的旧问题)你正在解析HTML。

让我告诉你:正则表达式不是解决方案。您应该使用HTML::TreeBuilder来解析HTML文档,因为HTML文档非常混乱。

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $div ($root->find_by_tag_name('div')) {
    if ($div->attr('class') eq 'v120WrapperInner') {
        foreach (my $a = $div->find_by_tag_name('a')) {
            print "m:\n", $a->attr('href'), "\n", $a->attr('title'), "\n";
        }
    }
}

答案 2 :(得分:2)

你很有可能在perl中获得regex的经验,但是对于这类工作,你可以考虑使用像XML::DOM这样的DOM解析器。

答案 3 :(得分:0)

天儿真好,

如果您在理解正则表达式时遇到问题,我可以建议您阅读Dale Dougherty出色的书“sed&amp; awk”(sanitised Amazon link)中的正则表达式介绍。

绝对是regexp的最佳介绍之一。

HTH

欢呼声,