有什么pythonic方式可以清除此词典列表?

时间:2019-06-20 12:19:16

标签: python dictionary duplicates

您好,感谢您的帮助。我有一个看起来像这样的字典列表:

list_balls = [{'id': '803371', 'is_used': False, 'source': 'store', 'air': 0.9},
{'id': '803371', 'is_used': False, 'source': 'donation', 'air': 0.20},
{'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75},
{'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}]

我需要清除此列表,以保留词典的唯一列表。如果有两个或两个以上具有相同ID的条目,我需要选择一个空中值最高的条目。如果它们的air和id具有相等的值,我需要将其保留为source =='store'。因此,这种情况下的结果将是

list_balls = [{'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75},
{'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}]

我尝试使用以下代码将需要删除的代码标记为keep = False,但仅在有两个重复项时才起作用:

for i in range(0, len(list_balls )):
    if len(list_balls ) > 1:
        #print(list_balls [i])
        for j in range(1, len(list_balls )):
            if (list_balls [i]['id'] == list_balls [j]['id']):
                if (list_balls [i]['air'] > list_balls [j]['air']):
                    list_balls [i]['keep'] = True
                    list_balls [j]['keep'] = False
print(list_pns)

我认为此double for循环也不是执行此操作的最有效方法,因此欢迎其他任何想法。谢谢您的帮助

6 个答案:

答案 0 :(得分:1)

使用itertools.groupby

例如:

from itertools import groupby
list_balls = [{'source': 'store', 'air': 0.9, 'id': '803371', 'is_used': False}, {'source': 'donation', 'air': 0.2, 'id': '803371', 'is_used': False}, {'source': 'donation', 'air': 0.75, 'id': '30042', 'is_used': False}, {'source': 'store', 'air': 1, 'id': '803371', 'is_used': False}]


#result = [max(list(v), key=lambda x: x["air"]) for k, v in groupby(sorted(list_balls, key=lambda x: x["id"]), lambda x: x["id"])]
result = [max(list(v), key=lambda x: (x["air"], x["source"] == "store")) for k, v in groupby(sorted(list_balls, key=lambda x: x["id"]), lambda x: x["id"])]
print(result)

输出:

[{'air': 0.75, 'id': '30042', 'is_used': False, 'source': 'donation'},
 {'air': 1, 'id': '803371', 'is_used': False, 'source': 'store'}]

答案 1 :(得分:1)

只需这样:

list_balls = [{'id': '803371', 'is_used': False, 'source': 'store', 'air': 0.9},
{'id': '803371', 'is_used': False, 'source': 'donation', 'air': 0.20},
{'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75},
{'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}]

result = {}

for e in list_balls:
    if e['id'] not in result or (
          (e['air'], e['source'] == 'store') > 
          (result[e['id']]['air'], result[e['id']]['source'] =='store')
        ):
        result[e['id']] = e

result_list = list(result.values())

print(result_list)

显示

[{'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}, {'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75}]

您可以直接比较元组以在多个条件下进行比较。请注意,True始终> False(1> 0)


与groupby和defaultdict解决方案相比,执行速度更快:

import random
from collections import defauldict
from itertools import groupby

list_balls = []
for _ in range(10000000):
    list_balls.append(
        {
            'source': random.choice(['store', 'donation']),
            'id': random.randint(0,10000),
            'air': random.randint(0,4)
        }
    )

def vanilla_filter_list(list_balls):
    result = {}

    for e in list_balls:
        if e['id'] not in result or (
              (e['air'], e['source'] == 'store') > 
              (result[e['id']]['air'], result[e['id']]['source'] =='store')
            ):
            result[e['id']] = e

    return list(result.values())

def groupby_filter_list(list_balls):
    return [max(list(v), 
                key=lambda x: (x["air"], x["source"] == "store")) for k, v in groupby(
        sorted(list_balls, key=lambda x: x["id"]),
        lambda x: x["id"])]

def collections_filter_list(list_balls):
    d = defaultdict(list)
    for ball in list_balls:
        d[ball["id"]].append(ball)

    return [
        max(group, key=lambda x: (x["air"], x["source"] == "store")) for group in d.values()
    ]

%%time
vanilla_filter_list(list_balls) # 5.52s

%%time
groupby_filter_list(list_balls) #14.3s

%%time
collections_filter_list(list_balls) #8.41s

答案 2 :(得分:0)

尝试一下:

all_id = set(i['id'] for i in list_balls)
new_list_ballls = []
for id_ in all_id:
    max_air = max(i['air'] for i in list_balls if i['id']==id_)
    max_air_count = sum(1 for i in list_balls if i['air']==max_air and i['id']==id_)
    if max_air_count==1:
        for i in list_balls:
            if i['id']==id_ and i['air']==max_air:
                new_list_ballls.append(i)
    else:
        for i in list_balls:
            if i['id']==id_ and i['air']==max_air and i['source'] != 'store':
                new_list_ballls.append(i)

输出

[{'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75}, 
{'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}]

答案 3 :(得分:0)

这里

from collections import defaultdict

list_balls = [{'id': '803371', 'is_used': False, 'source': 'store', 'air': 0.9},
              {'id': '803371', 'is_used': False, 'source': 'donation', 'air': 0.20},
              {'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75},
              {'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}]

grouped_data = defaultdict(list)

for entry in list_balls:
    grouped_data[entry['id']].append(entry)

final_list = []

for k, v in grouped_data.items():
    if len(v) == 1:
        final_list.append(v[0])
    else:
        # sort by air
        x = sorted(v, key=lambda k1: k1['air'], reverse=True)
        if x[0]['air'] != x[1]['air']:
            final_list.append(x[0])
        else:
            # decide by source
            if [x[0]]['source'] == 'store':
                final_list.append(x[0])
            elif [x[1]]['source'] == 'store':
                final_list.append(x[1])

for entry in final_list:
    print(entry)

输出

{'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}
{'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75}

答案 4 :(得分:0)

我首先将id与defaultdict分组,然后再由air得到最大字典。如果airid之间出现平局,则将source用作max()的辅助key

演示:

from collections import defaultdict

list_balls = [
    {"id": "803371", "is_used": False, "source": "store", "air": 0.9},
    {"id": "803371", "is_used": False, "source": "donation", "air": 0.20},
    {"id": "30042", "is_used": False, "source": "donation", "air": 0.75},
    {"id": "803371", "is_used": False, "source": "store", "air": 1},
    {"id": "803371", "is_used": False, "source": "donation", "air": 1},
]

d = defaultdict(list)
for ball in list_balls:
    d[ball["id"]].append(ball)

result = [
    max(group, key=lambda x: (x["air"], x["source"] == "store")) for group in d.values()
]

print(result)

输出:

[{'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}, {'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75}]

答案 5 :(得分:0)

没什么,几乎只有一个纯Python。
id对字典列表进行排序,然后按air的负值排序,以使最大的字典排在最前面,然后按source排序,以使带有store的条目排在最前面。之后,从每组字典中选择第一个条目,这些字典按id分组。

import pprint

list_balls = [
  {'id': '803371', 'is_used': False, 'source': 'store', 'air': 0.9},
  {'id': '803371', 'is_used': False, 'source': 'donation', 'air': 0.20},
  {'id': '30042', 'is_used': False, 'source': 'donation', 'air': 0.75},
  {'id': '803371', 'is_used': False, 'source': 'store', 'air': 1}
]
list_balls.sort(key=lambda k: (k['id'], -k['air'], 0 if k['source'] == 'store' else 1))
pprint.pprint([d for i, d in enumerate(list_balls) if i == 0 or list_balls[i - 1]['id'] != d['id']])

输出:

[{'air': 0.75, 'id': '30042', 'is_used': False, 'source': 'donation'},
 {'air': 1, 'id': '803371', 'is_used': False, 'source': 'store'}]