使用Python中的正则表达式将文本拆分成句子

时间:2018-02-20 07:14:50

标签: python regex split

我正在尝试将一个样本文本拆分成一个没有分隔符的句子列表,并且在每个句子的末尾没有空格。

示例文字:

第一次看到第二次文艺复兴时,它可能看起来很无聊。至少看两次,绝对看第2部分。它会改变你对矩阵的看法。人类是开始战争的人吗?人工智能是件坏事吗?

进入此(所需输出):

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']

我的代码目前是:

def sent_tokenize(text):
    sentences = re.split(r"[.!?]", text)
    sentences = [sent.strip(" ") for sent in sentences]
    return sentences

然而这输出(电流输出):

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing', '']

注意最后的额外''。

关于如何在当前输出结束时删除额外''的任何想法?

3 个答案:

答案 0 :(得分:9)

nltk的{​​{1}}

如果您从事NLP业务,我强烈推荐sent_tokenize包中的sent_tokenize

nltk

它比正则表达式更强大,并提供了很多选项来完成工作。更多信息可以在official documentation找到。

如果您对尾随分隔符很挑剔,可以使用>>> from nltk.tokenize import sent_tokenize >>> sent_tokenize(text) [ 'The first time you see The Second Renaissance it may look boring.', 'Look at it at least twice and definitely watch part 2.', 'It will change your view of the matrix.', 'Are the human people the ones who started the war?', 'Is AI a bad thing?' ] 稍微不同的模式:

nltk.tokenize.RegexpTokenizer

基于正则表达式的>>> from nltk.tokenize import RegexpTokenizer >>> tokenizer = RegexpTokenizer(r'[^.?!]+') >>> list(map(str.strip, tokenizer.tokenize(text))) [ 'The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing' ]

如果您必须使用re.split,那么您需要通过添加否定前瞻来修改您的模式 -

regex

添加的>>> list(map(str.strip, re.split(r"[.!?](?!$)", text))) [ 'The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing?' ] 指定仅在尚未到达行尾时才拆分。不幸的是,我不确定最后句子上的尾随分隔符是否可以在不执行(?!$)之类的情况下被合理删除。

答案 1 :(得分:3)

您可以使用过滤器删除空元素

<强>实施例

# views.py
@app.route('/')
def view():
    d = {'a': 1, 'b': True, 'c': 123}
    return render_template('api.html', d=d)


# index.html
<script>
  console.log({{d|tojson}})
  console.log({{d|tojson}}['c'])
</script>

# Console output
>>> {a: 1, b: true, c: 123}
>>> 123

答案 2 :(得分:0)

您可以在分割之前先strip段落,或者在结果中过滤掉空字符串。

相关问题