Question

在大数据领域，我是一个全新的火花。我有一个代码，实际上创建了一个函数，该函数拆分CSV文件并返回两个字段。

然后有一个我知道它如何工作的map函数，但是我对代码的下一部分感到困惑（正在对 totalsByAge 变量进行操作），正在应用mapValues和reduceByKey。请帮我了解reduceByKey和mapValues在这里如何工作？

def parseLine(line):
fields = line.split(',')
age = int(fields[2])
numFriends = int(fields[3])
return (age,numFriends)

line = sparkCont.textFile("D:\\ResearchInMotion\\ml-100k\\fakefriends.csv")
rdd = line.map(parseLine)
totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
averagesByAge = totalsByAge.mapValues(lambda x: x[0] / x[1])
results = averagesByAge.collect()
for result in results:
    print(result)

在 totalsByAge 变量处理中需要帮助。如果您还可以详细说明在 averagesByAge 上完成的操作，那将是很好的。 >

Answer 1

在rdd = line.map(parseLine)行中，您有一对(age, numFriends)格式的值，如(a_1, n_1), (a_2, n_2), ..., (a_m, n_m)。在rdd.mapValues(lambda x: (x, 1))中，您将获得(a_1, (n_1, 1)), (a_2, (n_2, 1)), ..., (a_m, (n_m, 1))。

在reduceByKey中，首先按键分组，这意味着将所有相同的age分组为一组，您将得到类似(a_i, iterator over pairs of (n_j, 1) which all n_j has the same age)的内容，然后应用归约功能。减少部分表示每个年龄段的所有numFriends彼此之和，和1彼此之和，1的总和表示列表中的数字项。

因此，在reduceByKey之后，我们将有(a_i, (sum of all numFriends in the list, number of items in the list))。换句话说，外部对的第一个值是age，第二个值是内部对，其第一个值是所有numFriends的总和，第二个值是项数。因此，totalsByAge.mapValues(lambda x: x[0] / x[1])给出了每个numFriends的平均值age。

reduceByKey和mapValues如何同时工作？

1 个答案: