我在列表中有一个带键值的rdd
rdd = [('12583', [('536370', 3.75), ('536370', 3.75), ('536370', 3.75)]),
('17850', [('536365', 2.55), ('536365', 3.39), ('536365', 2.75)]),
('13047', [('536367', 1.69), ('536367', 2.1), ('536368', 4.95), ('536368', 4.95), ('536369', 5.95)])]
我必须在每条记录的列表中添加每个键的值。我试着吼叫,但它没有通过mapValues不允许在列表上添加。
newRDD = rdd.groupByKey().map(lambda x : (x[0],list(x[1].mapValues(sum))))
我的预期结果如下:
[('12583', ('536370', 11.25)),
('17850', ('536365', 8.39)),
('13047', ('536367', 3.79),('536368', 9.9), ('536368', 10.9))]
答案 0 :(得分:2)
您可以使用collections.defaultdict
定义列表聚合函数:
def agg_list(lst):
from collections import defaultdict
agg = defaultdict(lambda : 0)
for k, v in lst:
agg[k] += v
return list(agg.items())
然后将其映射到rdd
:
rdd.map(lambda x: [x[0]] + agg_list(x[1])).collect()
# [['12583', ('536370', 11.25)],
# ['17850', ('536365', 8.69)],
# ['13047', ('536367', 3.79), ('536369', 5.95), ('536368', 9.9)]]