Question

我有一个巨大的文档，其中包含许多重复的句子，例如（页脚文本，带有字母数字字符的超链接），我需要摆脱那些重复的超链接或页脚文本。我尝试使用下面的代码，但不幸的是无法成功。请查看并提供帮助。

corpus = "We use file handling methods in python to remove duplicate lines in python text file or function. The text file or function has to be in the same directory as the python program file. Following code is one way of removing duplicates in a text file bar.txt and the output is stored in foo.txt. These files should be in the same directory as the python script file, else it won’t work.Now, we should crop our big image to extract small images with amounts.In terms of topic modelling, the composites are documents and the parts are words and/or phrases (phrases n words in length are referred to as n-grams).We use file handling methods in python to remove duplicate lines in python text file or function.As an example I will use some image of a bill, saved in the pdf format. From this bill I want to extract some amounts.All our wrappers, except of textract, can’t work with the pdf format, so we should transform our pdf file to the image (jpg). We will use wand for this.Now, we should crop our big image to extract small images with amounts."

from nltk.tokenize import sent_tokenize
sentences_with_dups = []
for sentence in corpus:
    words = sentence.sent_tokenize(corpus)
    if len(set(words)) != len(words):
        sentences_with_dups.append(sentence)
        print(sentences_with_dups)
    else:
        print('No duplciates found')

上述代码的错误消息：

AttributeError: 'str' object has no attribute 'sent_tokenize'

所需的输出：

Duplicates = ['We use file handling methods in python to remove duplicate lines in python text file or function.','Now, we should crop our big image to extract small images with amounts.']

Cleaned_corpus = {removed duplicates from corpus}

Answer 1

首先，您提供的示例在最后一个句号和下一个句子之间被空格弄乱了，它们之间缺少很多空格，因此我进行了清理。

然后您可以做：

corpus = "......"
sentences = sent_tokenize(corpus)

duplicates = list(set([s for s in sentences if sentences.count(s) > 1]))
cleaned = list(set(sentences))

以上将使订单混乱。如果您关心订单，可以执行以下操作来保留：

duplicates = []
cleaned = []
for s in sentences:
    if s in cleaned:
        if s in duplicates:
            continue
        else:
            duplicates.append(s)
    else:
        cleaned.append(s)

如何使用NLTK从段落中删除重复的句子？

1 个答案: