Question

嵌套的并行化？

假设我试图在Spark中执行相当于“嵌套for循环”的操作。像常规语言一样，假设我在内部循环中有一个例程来估计Pi的方式the Pi Average Spark example does (see Estimating Pi)

i = 1000; j = 10^6; counter = 0.0;

for ( int i =0; i < iLimit; i++)
    for ( int j=0; j < jLimit ; j++)
        counter += PiEstimator();

estimateOfAllAverages = counter / i;

我可以在Spark中嵌套并行化调用吗？我正在努力，还没有弄清楚扭结。很乐意发布错误和代码，但我想我会问一个关于这是否是Spark中正确方法的更具概念性的问题。

我已经可以并行化单个Spark示例/ Pi估计，现在我想做1000次以查看它是否收敛于Pi。（这与我们试图解决的更大问题有关，如果需要更接近MVCE的话，我很乐意添加）

底线问题我只需要有人直接回答：这是使用嵌套并行调用的正确方法吗？如果不是，请告知具体的事情，谢谢！这是我认为正确方法的伪代码方法：

// use accumulator to keep track of each Pi Estimate result

sparkContext.parallelize(arrayOf1000, slices).map{ Function call

     sparkContext.parallelize(arrayOf10^6, slices).map{
            // do the 10^6 thing here and update accumulator with each result
    }
}

// take average of accumulator to see if all 1000 Pi estimates converge on Pi

背景：我had asked this question and got a general answer but it did not lead to a solution，经过一番胡扯后，我决定发布一个具有不同特征的新问题。我也tried to ask this on the Spark User maillist但也没有骰子。提前感谢您的帮助。

Answer 1

由于DELETE FROM records_table WHERE customer_id NOT IN (SELECT customer_id FROM customer_table);不可序列化，因此甚至无法实现。如果你想要一个嵌套的for循环，那么你最好的选择就是使用SparkContext

cartesian

请记住，就像双val nestedForRDD = rdd1.cartesian(rdd2) nestedForRDD.map((rdd1TypeVal, rdd2TypeVal) => { //Do your inner-nested evaluation code here })循环一样，这需要大小代价。

Answer 2

没有。你不能。

SparkContext只能从spark Driver节点访问。内部parallelization（）调用将尝试从工作节点执行SparkContext，而工作节点无权访问SparkContext。

在Spark中嵌套并行化？什么是正确的方法？

2 个答案: