Question

我正在做一个perl程序（脚本？），它读取文本文件并识别所有名称，并将它们分类为人员，位置，组织或杂项。我遇到像纽约或太平洋第一金融公司这样的问题，其中连续多个大写单词。我一直在用：

/([A-Z][a-z]+)+/

连续捕获与给定行相同数量的大写单词。根据我的理解，+将匹配这种模式的一个或多个实例，但它只匹配一个（即纽约的New）。对于纽约，我可以重复[A-Z] [a-z] +两次，但它没有找到连续超过2个大写单词的模式。我做错了什么？

PS对不起，如果我对词汇的使用不行，我总是那么糟糕。

Answer 1

你只是错过了单词之间的间距。

以下匹配每个单词之前的空格，除了第一个单词之外，所以涵盖了您所描述的案例：

use strict;
use warnings;

while (<DATA>) {
    while (/(?=\w)((?:\s*[A-Z][a-z]+)+)/g) {
        print "$1\n";
    }
}

__DATA__
I'm doing a perl program (script?) that reads through a text file and identifies all names and categorizes them as either person, location, organization, or miscellaneous. I'm having trouble with things like New York or Pacific First Financial Corp. where there are multiple capitalized words in a row. I've been using:

to capture as many capitalized words in a row as there are on a given line. From what I understand the + will match 1 or more instances of such pattern, but it's only matching one (i.e. New in New York). For New York, I can just repeate the [A-Z][a-z]+ twice but it doesn't find patterns with more than 2 capitalized words in a row. What am I doing wrong?

PS Sorry if my use of vocabulary is off I'm always so bad with that.

输出：

New York
Pacific First Financial Corp
From
New
New York
For New York
What
Sorry

Answer 2

有一个名为Lingua::EN::NamedEntity的CPAN模块似乎可以满足您的需求。可能值得快速浏览一下。

Answer 3

如何

您提供的模式/([A-Z][a-z]+)+/在您的问题中与连续给出的大写单词匹配，就像这样

This
ThisAndThat

但它不匹配此

Not This

它实际上与每个单独匹配

Not
This

因此，我们将正则表达式修改为/(?:[A-Z][a-z]+)(?:\s*[A-Z][a-z]+)*/。现在这有点令人满意，所以让我们一次分解一下

(?: ... )      Groups like this don't capture which is more efficient
[A-Z][a-z]+    Matches a capitalised word
\s*[A-Z][a-z]+ Matches a subsequent capitalised word, optionally starting with
               whitespace

什么 - TL; DR

把这一切放在一起，我们现在有一个匹配大写单词的正则表达式，然后是任何后续的有或没有空格分隔的单词。所以匹配

This
ThisAndThat
Not This

我们现在可以抽象这个正则表达式以避免重复并在代码中使用它

my $CAPS_WORD = qr/[A-Z][a-z]+/;
my $FULL_RE   = qr/(?:$CAPS_WORD)(?:\s*$CAPS_WORD)*/;

$string =~ /$FULL_RE/;
say $&;

为什么

这个答案提供了@Miller给出的已经很好的替代方案，两者都可以正常工作但是这个解决方案速度要快得多，因为它没有使用前瞻。 This比this快7倍。

$ time ./bench-simple.pl
Running 100000 runs
800000 matches

real    0m2.869s
user    0m2.860s
sys     0m0.008s

$ time ./bench-lookahead.pl
Running 100000 runs
800000 matches

real    0m19.845s
user    0m19.831s
sys     0m0.012s

Perl匹配多个大写单词

3 个答案:

如何

什么 - TL; DR

为什么