Question

我有这样的字符串

text1 =“ sedentary。艾伦·塔科科克（Allan Takocok）。这是本周在《新英格兰医学杂志》上发表的两项研究的结论。”

我想在此文本中提取以大写字母开头但不跟句号开头的单词。因此，应在没有[That's Allan]的情况下提取[Takocok新英格兰医学杂志]。

我尝试过此正则表达式，但仍提取Allan，就是这样。

t=re.findall("((?:[A-Z]\w+[ -]?)+)",text1)

Answer 1

以下是使用re.findall的选项：

text1 = "sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
matches = re.findall(r'(?:(?<=^)|(?<=[^.]))\s+([A-Z][a-z]+)', text1)
print(matches)

此打印：

['Takocok', 'The', 'New', 'England', 'Journal', 'Medicine']

以下是正则表达式模式的说明：

(?:(?<=^)|(?<=[^.]))   assert that what precedes is either the start of the string,
                       or a non full stop character
\s+                    then match (but do not capture) one or more spaces
([A-Z][a-z]+)          then match AND capture a word starting with a capital letter

Answer 2

这应该是您要寻找的正则表达式：

(?<!\.)\s+([A-Z][A-Za-z]+)

在此处查看regex101：https://regex101.com/r/EoPqgw/1

Answer 3

在这种情况下，可能有可能找到一个正则表达式，但它会变得混乱。

相反，我建议采用两步法：

将文本拆分为令牌
处理这些标记以提取有趣的单词

tokens = [
    'sedentary',
    '.',
    ' ',
    'Allan',
    ' ',
    'Takocok',
    '.',
    ' ',
    'That\'s',
    …
]

这种令牌拆分已经足够复杂了。

使用此标记列表，可以轻松表达实际要求，因为您现在使用定义明确的标记而不是任意字符序列。

我将空格保留在标记列表中，因为您可能想区分“ a.dotted.brand.name”或“ www.example.org”与句子末尾的点。

使用此令牌列表，可以比以前更轻松地表达“必须紧跟一个点”等规则。

我希望您的规则随着时间的流逝而变得相当复杂，因为您正在处理自然语言文本。因此，令牌的抽象。

提取单词以大写字母开头

3 个答案: