识别文本中的句子

时间:2019-05-17 07:52:32

标签: python regex python-3.x string

在为特定的极端情况正确识别文本中的句子时,我有些麻烦:

  1. 如果涉及点,点,点,则不会保留。
  2. 如果涉及"
  3. 如果句子不小心以小写开头。

到目前为止,这是我识别文本中句子的方式(来源:Subtitles Reformat to end with complete sentence):

re.findall部分基本上是查找str的一部分,该部分以大写字母[A-Z]开头,然后是除标点符号之外的所有内容,然后以标点符号[\.?!]结尾

import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")
We were able to respond to the first research question.

Next, we also determined the size of the population.

案例1:点,点,点

不保留点,点,点,因为如果连续出现三个点,则没有给出如何处理的说明。如何更改?

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
We were able to respond to the first research question.

Next, we also determined the size of the population.

案例2:

"符号已成功保留在句子中,但是像标点后面的点一样,它会在末尾删除。

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.

案例3:小写句子的开头

如果一个句子意外地以小写开头,则该句子将被忽略。目的是确定先前的句子已结束(或文本刚开始),因此必须开始新的句子。

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
  

We were able to respond to the first research question.

非常感谢您的帮助!

编辑:

我测试过:

import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

...但是我得到了

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
     

nlp.add_pipe(nlp.create_pipe('sentencizer'))或者,添加   依赖分析器,或通过设置来设置句子边界   doc [i] .is_sent_start。

4 个答案:

答案 0 :(得分:2)

您可以为此使用一些工业包装。例如,spacy具有非常好的句子标记器。

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

您的方案:

  1. 案例结果-> ['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']

  2. 案例结果-> ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']

  3. 案例结果-> ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

答案 1 :(得分:2)

您可以修改正则表达式以匹配您的特殊情况。

首先,您不需要在.内转义[]

对于第一个极端情况,您可以用[.!?]*贪婪地匹配end-ancetance-token。

第二次,您可以匹配"之后的[.!?]

对于最后一个,您可以从上或下开始:

import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)

说明

  • [A-z],每次比赛都应以大写或小写字母开头。
  • [^.?!]*,它贪婪地匹配不是.?!(结束情感字符)的任何字符
  • [.?!]*,它会贪婪地匹配结尾字符,因此...??!!???将作为情感部分进行匹配
  • "?,它最终与句子结尾处的报价匹配

情况一:

  

我们能够回答第一个研究问题...   接下来,我们还确定了人口规模。

情况2:

  

我们能够回答第一个“研究”问题:“这是什么?”   接下来,我们还确定了人口规模。

情况3:

  

我们能够回答第一个研究问题。   接下来,我们还确定了人口规模。

答案 2 :(得分:1)

您可以使用nltk sent_tokenize。这样可以避免很多麻烦。

from nltk import sent_tokenize
# Corner Case 1: Dot, Dot, Dot
text_dot_dot_dot = "We were able to respond to the first research question... Next, we also determined the size of the population."
print("Corner Case 1: ", sent_tokenize(text_dot_dot_dot))
# Corner Case 1: "
text_ = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
print("Corner Case 2: ", sent_tokenize(text_))
# Corner Case 1: lower case
text_lower = "We were able to respond to the first research question. next, we also determined the size of the population."
print("Corner Case 2: ", sent_tokenize(text_lower))

结果:

Corner Case 1:  ['We were able to respond to the first research question... Next, we also determined the size of the population.']
Corner Case 2:  ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
Corner Case 2:  ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

答案 3 :(得分:0)

尝试以下正则表达式: ([A-Z] [^。!?] * [。!?] + [“]?)

'+'表示一个或多个

'?'表示零或更多

这应该通过您上面提到的所有3个极端情况