如何在PySpark中对List中的Key值对求值

时间:2017-09-20 13:37:31

标签: apache-spark pyspark

我在列表中有一个带键值的rdd

rdd = [('12583', [('536370', 3.75), ('536370', 3.75), ('536370', 3.75)]), 
       ('17850', [('536365', 2.55), ('536365', 3.39), ('536365', 2.75)]), 
       ('13047', [('536367', 1.69), ('536367', 2.1), ('536368', 4.95), ('536368', 4.95), ('536369', 5.95)])]

我必须在每条记录的列表中添加每个键的值。我试着吼叫,但它没有通过mapValues不允许在列表上添加。

newRDD = rdd.groupByKey().map(lambda x : (x[0],list(x[1].mapValues(sum)))) 

我的预期结果如下:

[('12583', ('536370', 11.25)), 
('17850', ('536365', 8.39)), 
('13047', ('536367', 3.79),('536368', 9.9), ('536368', 10.9))]

1 个答案:

答案 0 :(得分:2)

您可以使用collections.defaultdict定义列表聚合函数:

def agg_list(lst):
    from collections import defaultdict
    agg = defaultdict(lambda : 0)
    for k, v in lst:
        agg[k] += v
    return list(agg.items())

然后将其映射到rdd

rdd.map(lambda x: [x[0]] + agg_list(x[1])).collect()
# [['12583', ('536370', 11.25)], 
#  ['17850', ('536365', 8.69)], 
#  ['13047', ('536367', 3.79), ('536369', 5.95), ('536368', 9.9)]]