Perl - Regex只提取以逗号分隔的字符串

时间:2013-04-25 11:01:28

标签: regex perl split comma www-mechanize

我有一个问题,我希望有人可以提供帮助...

我有一个包含网页内容的变量(使用WWW :: Mechanize抓取)。

变量包含以下数据:

$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"

我对以上示例感兴趣的唯一内容是:

@array = ("cat_dog","horse","rabbit","chicken-pig")
@array = ("elephant","MOUSE_RAT","spider","lion-tiger") 
@array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")

我遇到的问题:

我试图仅从变量中提取逗号分隔的字符串,然后将它们存储在数组中以供以后使用。

但是,确保我在逗号分隔动物列表的开头(即cat_dog)和结尾(即鸡 - 猪)获得字符串的最佳方法是什么,因为它们没有前缀/后缀逗号

同样,由于变量将包含网页内容,因此不可避免地会出现逗号立即由空格继续,然后是另一个单词的情况,因为这是在段落和句子中使用逗号的正确方法......

例如:

Saturn was long thought to be the only ringed planet, however, this is now known not to be the case. 
                                                     ^        ^
                                                     |        |
                                    note the spaces here and here

我对逗号后跟空格的任何情况都不感兴趣(如上所示)。

我只对逗号之后没有空格的情况感兴趣(即cat_dog,horse,rabbit,chicken-pig)

我尝试了很多方法,但是无法找到构建正则表达式的最佳方法。

4 个答案:

答案 0 :(得分:8)

怎么样

[^,\s]+(,[^,\s]+)+

将匹配一个或多个不是空格或逗号的字符[^,\s]+,后跟逗号和一个或多个不是空格或逗号的字符,一次或多次。

进一步评论

要匹配多个序列,请添加g修饰符以进行全局匹配 以下内容将$&上的每个匹配,拆分,并将结果推送到@matches

my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my @matches;

while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
    push(@matches, split(/,/, $&));
}   

print join("\n",@matches),"\n";

答案 1 :(得分:1)

虽然您可以构建单个正则表达式,但正则表达式的组合split s,grepmap看起来不错

my @array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split

从右到左:

  1. 在空格(split
  2. 上拆分线
  3. 只保留两端没有逗号但内部有一个(grep
  4. 的元素
  5. 将每个此类元素拆分为多个部分(mapsplit
  6. 通过这种方式,您可以轻松更改部件,例如消除两个连续的逗号在&& !/,,/内添加grep

答案 2 :(得分:1)

我希望这很清楚,适合您的需求:

 #!/usr/bin/perl
    use warnings;
    use strict;

    my @strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
    "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf", 
     "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew", 
     "Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
     "Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");

    my $regex = qr/
                \s #From your examples, it seems as if every
                   #comma separated list is preceded by a space.
                (
                    (?:
                        [^,\s]+ #Now, not a comma or a space for the
                                 #terms of the list

                        ,        #followed by a comma
                    )+
                    [^,\s]+     #followed by one last term of the list
                )
                /x;

    my @matches = map {
                    $_ =~ /$regex/;
                    if ($1) {
                        my $comma_sep_list = $1;
                        [split ',', $comma_sep_list];
                    }
                    else {
                        []
                    }
                } @strs;

答案 3 :(得分:0)

$var =~ tr/ //s;    
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
      push (@arr, $&);
    }

正则表达式匹配三种情况:

(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,)      : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig