PYTHON - 如何从文本文件中提取包含引文标记的句子

时间:2017-08-13 08:30:12

标签: python regex nlp text-extraction citations

例如,我有3个句子,如下图所示,中间的一个句子包含引号(Warren and Pereira, 1982)。引用总是以这种格式括起来:(~tring~逗号(,)~space~ number~)

  

他住在Nidarvoll,今晚我必须在6点钟到达奥斯陆的火车。该系统名为BusTUC,建立在经典系统CHAT-80(Warren和Pereira,1982)之上。 CHAT-80是最先进的自然语言系统   令人印象深刻的是它的优点。

我使用正则表达式只提取中间句子,但它会打印所有3个句子。 结果应该是这样的:

  

该系统名为BusTUC,建立在经典的CHAT-80系统之上(Warren和Pereira,1982)。

2 个答案:

答案 0 :(得分:1)

设置......代表感兴趣案例的2个句子:

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."

首先,在引文位于句子末尾的情况下进行匹配:

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"

当引文不在句末时匹配:

p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"

将两种情况与“|”结合正则表达式运算符:

p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")

运行:

>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]

>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]

在这两种情况下,你都会得到带引号的句子。

一个好的资源是python正则表达式documentation和随附的正则表达式howto页面。

干杯

答案 1 :(得分:0)

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

您可以将文本拆分为句子列表,然后选择以")"结尾的文本。

sentences = text.split(".")[:-1]

for sentence in sentences:
    if sentence[-1] == ")":
        print sentence