什么是正则表达式?

时间:2015-04-04 11:22:35

标签: python regex nltk

我使用Nltk的punkt tokenizer将一个段落标记为句子,但在某些情况下,如下面的示例,tokenizer无法识别句子,因为句点后跟数字。我想使用正则表达式识别这些场景,并将'.1,7,9'替换为'. 1,7,9',即在引用和句点之间添加空格。

Ex1.  `This is a random sentence.1,7,9 This is a sentence followed by it.`
Ex2. I love football.1,7,24`I also like cricket.

Ex3. ESD for undifferentiated  cancers.[1][7]`Cancers can be treatable.

预期产出:

EX1. This is a random sentence.
     1,7,9 This is a sentence followed by it.
Ex2. I love football.
     ESD for undifferentiated  cancers.1,7
Ex3. ESD for undifferentiated  cancers.1,7
     [1][7]`Cancers can be treatable.

谢谢。

2 个答案:

答案 0 :(得分:1)

以下正则表达式将替换所有点,后跟非空格字符. + \n

>>> import re
>>> s = "Ex1.  This is a random sentence.1,7,9 This is a sentence followed by it."
>>> print(re.sub(r'\.(\S)', r'.\n\1', s))
Ex1.  This is a random sentence.
1,7,9 This is a sentence followed by it.

DEMO

答案 1 :(得分:0)

如果附加的整数列表是引用,则在整数列表之后放置字符返回可能很有用:

>>> import re
>>> s = "Ex1.  This is a random sentence.1,7,9 This is a sentence followed by it."
>>> print(re.sub(r'(\.\S+\s)', r'\1\n', s))
Ex1.  This is a random sentence.1,7,9 
This is a sentence followed by it.