Question

我有一个pandas数据框，其中包含来自网站的按行抓取的文章。我有十万本类似的文章。

这是我的数据集的一瞥。

text
0   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
1   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
2   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
3   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
4   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
5   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
6   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
7   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
8   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
for those who werent as productive as they would have liked during the first half of 2018
28  for those who werent as productive as they would have liked during the first half of 2018
29  for those who werent as productive as they would have liked during the first half of 2018
30  for those who werent as productive as they would have liked during the first half of 2018
31  for those who werent as productive as they would have liked during the first half of 2018
32  for those who werent as productive as they would have liked during the first half of 2018

现在，这些是每个文本的序言，它们是重复的。主要文字在这些文字之后。

有没有可能的方法或功能，可以识别这些文本并用几行代码将其刷出。

Answer 1

我认为您可以以某种方式使用difflib，例如：

>>> import difflib
>>> a = "my mother always told me to mind my business" 
>>> b = "my mother always told me to be polite"
>>> s = difflib.SequenceMatcher(None,a,b)
>>> s.find_longest_match(0,len(a),0,len(b))

输出：

Match(a=0, b=0, size=28)

其中a=0表示匹配序列从字符串0中的字符a开始，而b=0意味着匹配序列从字符串0开始字符串b。

现在，如果您这样做：

>>> b.replace(a[:28],"")

结果将是：

'be polite'

如果您选择执行c = s.find_longest_match(0,len(a),0,len(b))，则选择c[0] = 0，c[1] = 0和c[2] = 28。

您可以在此处了解更多信息： https://docs.python.org/2/library/difflib.html

Answer 2

如果您要删除完全相同的字符串，请对数据框进行排序，然后依次进行排序。（这类似于Nerdrigo在评论中提到的内容。）

sents = ... # sorted dataframe
out = [] # stuff here will be unique
for ii in range(len(sents) - 1):
    if sents[ii] != sents[ii + 1]:
        out.append(sents[ii])

如果您要删除非常相似但不完全相同的句子，则问题会变得更加棘手，并且没有简单的解决方案。您需要查看对地区敏感的哈希或近重复检测。 datasketch库可能会有所帮助。

根据您的评论，我想我终于明白了-您想删除通用前缀。在这种情况下，将上面的代码修改如下：

sents = ... # sorted dataframe
out = [] # cleaned sentences go here
lml = -1 # last match length
for ii in range(len(sents) - 1):
    # first check if the match from the last iteration still works
    if sents[ii][:lml] == sents[ii+1][:lml] and sents[ii][:lml + 1] != sents[ii+1][:lml + 1]:
        # old prefix still worked, chop and move on
        out.append(sents[ii][lml:])
        continue

    # if we're here, it means the prefix changed
    ml = 1 # match length
    # find the longest matching prefix
    while sents[ii][:ml] == sents[ii+1][:ml]:
        ml += 1

    # save the prefix length
    lml = ml
    # chop off the shared prefix
    out.append(sents[ii][ml:])

从熊猫行中删除多个重复文本

2 个答案: