在元组列表中查找重复项

时间:2017-11-20 03:01:41

标签: python algorithm list duplicates tuples

您将获得有关您网站用户的信息。该信息包括用户名,电话号码和/或电子邮件。编写一个程序,该程序接收元组列表,其中每个元组表示特定用户的信息,并返回列表列表,其中每个子列表包含包含有关同一个人的信息的元组索引。例如:

Input:
[("MLGuy42", "andrew@example.com", "123-4567"),
("CS229DungeonMaster", "123-4567", "ml@example.net"),
("Doomguy", "john@example.org", "carmack@example.com"),
("andrew26", "andrew@example.com", "mlguy@example.com")]

Output:
[[0, 1, 3], [2]]

自" MLGuy42"," CS229DungeonMaster"和" andrew26"都是同一个人。

输出中的每个子列表都应该排序,外部列表应该按子列表中的第一个元素排序。

以下是我为此问题所做的代码段。它似乎工作正常,但我想知道是否有更好/优化的解决方案。任何帮助,将不胜感激。谢谢!

def find_duplicates(user_info):
    results = list()
    seen = dict()
    for i, user in enumerate(user_info):
        first_seen = True
        key_info = None
        for info in user:
            if info in seen:
                first_seen = False
                key_info = info
                break
        if first_seen:
            results.append([i])
            pos = len(results) - 1
        else:
            index = seen[key_info]
            results[index].append(i)
            pos = index
        for info in user:
            seen[info] = pos
    return results

2 个答案:

答案 0 :(得分:1)

我认为我已经使用图表达到了优化的工作解决方案。基本上,我创建了一个图表,每个节点都包含其用户信息及其索引。然后,使用dfs遍历图形并找到重复项。

答案 1 :(得分:0)

我认为我们可以使用集合来简化它:

from random import shuffle

def find_duplicates(user_info):

    reduced = unreduced = {frozenset(info): [i] for i, info in enumerate(user_info)}

    while reduced is unreduced or len(unreduced) > len(reduced):

        unreduced = dict(reduced)  # make a copy

        for identifiers_1, positions_1 in unreduced.items():

            for identifiers_2, positions_2 in unreduced.items():

                if identifiers_1 is identifiers_2:
                    continue

                if identifiers_1 & identifiers_2:
                    del reduced[identifiers_1], reduced[identifiers_2]
                    reduced[identifiers_1 | identifiers_2] = positions_1 + positions_2
                    break
            else:  # no break
                continue

            break

    return sorted(sorted(value) for value in reduced.values())

my_input = [ \
    ("CS229DungeonMaster", "123-4567", "ml@example.net"), \
    ("Doomguy", "john@example.org", "carmack@example.com"), \
    ("andrew26", "andrew@example.com", "mlguy@example.com"), \
    ("MLGuy42", "andrew@example.com", "123-4567"), \
]

shuffle(my_input)  # shuffle to prove order independence

print(my_input)
print(find_duplicates(my_input))

<强>输出

> python3 test.py
[('CS229DungeonMaster', '123-4567', 'ml@example.net'), ('MLGuy42', 'andrew@example.com', '123-4567'), ('andrew26', 'andrew@example.com', 'mlguy@example.com'), ('Doomguy', 'john@example.org', 'carmack@example.com')]
[[0, 1, 2], [3]]
>