Python:如何删除字符串中彼此不相邻的重复单词?

时间:2016-08-21 16:47:41

标签: python string

在下面的示例中,我只需要删除字符串中唯一的第三个“animale”。我怎么能这样做?

a = 'animale animale eau toilette animale'

第二个“动画”:不要删除

第三个“动画”:删除

4 个答案:

答案 0 :(得分:1)

这个怎么样

from collections import defaultdict

def remove_no_adjacent_duplicates(string):
    position = defaultdict(list)
    words = string.split()
    for i,w in enumerate(words):
        position[w].append(i)
    for w,pos_list in position.items():
        adjacent = set()
        for i in range(1,len(pos_list)):
            if pos_list[i-1] +1 == pos_list[i]:
                adjacent.update( (pos_list[i-1],pos_list[i]) )
        if adjacent:
            position[w] = adjacent
        else:
            position[w] = pos_list[:1]
    return " ".join( w for i,w in enumerate(words) if i in position[w] )

print( remove_no_adjacent_duplicates('animale animale eau toilette animale') )
print( remove_no_adjacent_duplicates('animale animale eau toilette animale eau eau' ) )
print( remove_no_adjacent_duplicates('animale eau toilette animale eau eau' ) )
print( remove_no_adjacent_duplicates('animale eau toilette animale eau de eau de toilette' ) )

输出

animale animale eau toilette
animale animale toilette eau eau
animale toilette eau eau
animale eau toilette de

解释

首先我记录position字典中每个单词的位置,然后我继续检查每个单词中是否有相邻的位置,如果有的话我将它们保存在一个集合中,如果找到任何已完成,我交换该组相邻的位置列表,否则删除除第一个之外的所有保存位置,最后重建字符串

答案 1 :(得分:0)

a = "animale animale eau toilette animale"

words = a.split()

cleaned_words = []
skip = False
for i in range(len(words)):
    word = words[i]
    print(word)
    if skip:
        cleaned_words.append(word)
        skip = False
    try:
        next_word = words[i+1]
        print(next_word)
    except IndexError:
        break
    if word == next_word:
        cleaned_words.append(word)
        skip = True
        continue
    if word not in cleaned_words:
        cleaned_words.append(word)

print(cleaned_words)

相当丑陋,粗糙的解决方案,但它完成了工作。

答案 2 :(得分:0)

如果我正确理解您的问题,您希望删除任何重复但不相邻的单词。我认为这个解决方案适用于此:

from collections import defaultdict

def remove_duplicates(s):
    result = []
    word_counts = defaultdict(int)
    words = s.split()
    # count the frequency of each word
    for word in words:
        word_counts[word] += 1
    # loop through all words, and only add to result if either it occurs only once or occurs more than once and the next word is the same as the current word.
    for i in range(len(words)-1):
        curr_word = words[i]
        if word_counts[curr_word] > 1:
            if words[i+1] == curr_word:
                result.append(curr_word)
                result.append(curr_word)
                word_counts[curr_word] = -1    # mark as -1 so as not to add again
                i += 1       # skip the next word by incrementing i manually because it has already been added
            # if there are only two occurrences of the word left but they aren't adjacent, add one and mark the counts so you don't add it again.
            elif word_counts[curr_word] < 3:
                result.append(curr_word)
                word_counts[curr_word] = -1    # mark as -1 so as not to add again
            # not adjacent but more than 2 occurrences left so decrement number of occurrences left
            else:
                word_counts[curr_word] -= 1 
        elif word_counts[curr_word] == 1:
            result.append(curr_word)
            word_counts[curr_word] = -1
    # Fix off by one error by checking last index
    if word_counts[words[-1]] == 1:
        result.append(words[-1]) 
    return ' '.join(result)

我认为这适用于任何重复单词不相邻的情况,包括@ Dartmouth的'animale animale eau toilette animale eau eau'的例子。

示例输入和输出:

 Inputs                                               Outputs
 =============================================       =========================================
'animale animale eau toilette animale'                  ---->     'animale animale eau toilette'
'animale animale eau toilette animale eau eau'          ---->     'animale animale toilette eau eau'
'animale eau toilette animale eau eau'                  ---->     'animale toilette eau eau' 
'animale eau toilette animale eau de eau de toilette'   ---->     'animale toilette eau de'
'animale animale eau toilette animale eau eau compte'   ---->     'animale animale toilette eau eau compte'

答案 3 :(得分:0)

这个适用于两者:

'animale animale eau toilette animale'

'animale animale eau toilette animale eau eau'

以下是代码:

from collections import Counter


def cleanup(words):
    splitted = words.split()
    counter = Counter(splitted)
    more_than_one = [x for x in counter.keys() if counter[x] > 1]
    orphan_indexes = []

    before = True

    for i in range(len(splitted)):
        if i == len(splitted):
            break
        if i > 0:
            before = splitted[i] != splitted[i-1]
        if i+1 <= len(splitted):
            try:
                after = splitted[i] != splitted[i+1]
            except IndexError:
                after = True
        if before and after:
            if splitted[i] in more_than_one:
                orphan_indexes.append(i)

    return ' '.join([
        item for i, item in enumerate(splitted)
        if i not in orphan_indexes
    ])


print cleanup('animale animale eau toilette animale')
print cleanup('animale animale eau toilette animale eau eau')

结果:

animale animale eau toilette
animale animale toilette eau eau