Question

我有 2个RDD（在pyspark中），格式为rdd1=(id1, value1)和rdd2=(id2, value2)，其中id是唯一的（即，所有id1与id2都不相同）。

我以resultRDD=((id1, id2), value3)格式有第三个RDD。我想过滤后者，以便仅保留带有value3 > (value1+value2)的元素。

如果我访问rdd1和rdd2，则会收到以下异常：

pickle.PicklingError: Could not serialize object: Exception: It appears that you
 are attempting to broadcast an RDD or reference an RDD from an action or transf
ormation. RDD transformations and actions can only be invoked by the driver, not
 inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.co
unt() * x) is invalid because the values transformation and count action cannot
be performed inside of the rdd1.map transformation. For more information, see SP
ARK-5063.

那么为了过滤resultRDD，访问rdd1和rdd2的最佳策略是什么？

solution1:

如果我广播rdd1和rdd2可以，但是我认为这不是优化的解决方案，因为rdd1和rdd2很大。

solution2:

代替广播rdd1和rdd2，我们可以收集rdd1和rdd2，因此可以进行过滤。那么请问我这种情况下有效的解决方案是什么？

我的功能如下：

def filterResultRDD(resultRDD, rdd1, rdd2):


    source = rdd1.collect()
    target = rdd2.collect()
    f = resultRDD.filter(lambda t: t[1] >= getElement(source, t[0][0])+ getElement(target, t[0][1])).cache()
    return f

def getElement(mydata, key):
    return [item[1] for item in mydata if item[0] == key][0]

Answer 1

首先介绍您建议的解决方案：
solution2 ：
永远不要收集rdd。
如果您收集rdd，则意味着您的解决方案将不可扩展，或者这意味着您首先不需要rdd。
solution1 ：
类似于对solution2的引用，但是有一些例外，您的情况不是这些例外之一。

如前所述，执行此操作的“火花”方法是使用“ join”。
当然，无需进行转换即可触发数据框。

这是一个解决方案：

rdd1 = sc.parallelize([('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)])
rdd2 = sc.parallelize([('aa', 1), ('bb', 2), ('cc', 3), ('dd', 4), ('ee', 5)])
rdd3 = sc.parallelize([(('a', 'aa'), 1), (('b', 'dd'), 8), (('e', 'aa'), 34), (('c', 'ab'), 23)])

print rdd3.map(lambda x: (x[0][0], (x[0][1], x[1])))\
.join(rdd1)\
.map(lambda x: (x[1][0][0], (x[0], x[1][0][1], x[1][1]))).join(rdd2)\
.filter(lambda x: x[1][0][1] > x[1][0][2] + x[1][1])\
.map(lambda x: ((x[1][0][0], x[0]), x[1][0][1]))\
.collect()

--> [(('b', 'dd'), 8), (('e', 'aa'), 34)]

如何在另一个RDD中访问一个RDD？

1 个答案: