Question

我应该使用NLTK还是正则表达式来拆分它？
如何选择代词（他/她）。我想选择任何有代词的句子。

这是大型项目的一部分，我是Python的新手。能否请您指出任何有用的代码？

Answer 1

我正在开发一个有类似需求的NLP项目。我建议您使用 NLTK ，因为它使事情变得非常简单，并为我们提供了很大的灵活性。由于您需要收集所有带代词的句子，您可以拆分文本中的所有句子并将其保存在列表中。然后，您可以遍历列表并查找包含代词的句子。另外，请确保记下句子的索引（在列表中），或者您可以形成一个新列表。

以下示例代码：

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []

for sentence in sentences:
    words = word_tokenize(sentence)
    for word in words:
        word_pos = pos_tag([word])
        if word_pos[0][1] == 'PRP':
            sentences_with_pronouns.append(sentence)
            break

print sentences_with_pronouns

<强>输出：

['she also loves to play chess.']

P.S。同时检查pylucene和whoosh库是非常有用的NLP python 包。

Answer 2

NLTK是你最好的选择。给定一串句子作为输入，您可以通过执行以下操作获得包含代词的句子列表：

from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
       if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]

返回：

['Take it or leave it.']

基本上我们将字符串拆分为句子列表，将这些句子分成单词列表，并将每个句子的单词列表转换为部分语音标签的集（这一点非常重要如果我们不这样做，当我们在一个句子中有多个代词时，我们会得到重复的句子。）

我怎样才能看几段文字，看看是否有任何一个句子都有一个代词并选择所有这些句子来制作一个新段落？

2 个答案: