Question

TL; DR;

“加入数千个Spark的最佳方法是什么数据框？我们可以并行化此联接吗？由于两者都不起作用我。”

我试图连接成千上万的单列数据框（使用PK col进行连接），然后将结果DF保留到Snowflake。

在一个独立的32core / 350g Spark集群上，通过循环约400个（5m x 2）这样的数据帧进行的连接操作需要3个小时以上才能完成，由于下推，我认为这并不重要。毕竟，Spark难道不应该为延迟评估而构建DAG吗？

这是我的Spark配置：

spark = SparkSession \
    .builder \
    .appName("JoinTest")\
    .config("spark.master","spark://localhost:7077")\
    .config("spark.ui.port", 8050)\
    .config("spark.jars", "../drivers/spark-snowflake_2.11-2.5.2-spark_2.4.jar,../drivers/snowflake-jdbc-3.9.1.jar")\
    .config("spark.driver.memory", "100g")\
    .config("spark.driver.maxResultSize", 0)\
    .config("spark.executor.memory", "64g")\
    .config("spark.executor.instances", "6")\
    .config("spark.executor.cores","4") \
    .config("spark.cores.max", "32")\
    .getOrCreate()

还有JOIN循环：

def combine_spark_results(results, joinKey):
    # Extract first to get going
    # TODO: validations
    resultsDF = results[0]
    i = len(results)

    print("Joining Spark DFs..")
    for result in results[1:]:
        print(i, end=" ", flush=True)
        i -= 1
        resultsDF = resultsDF.join(result, joinKey, 'outer')

    return resultsDF

我考虑过使用starmapasync（）以合并排序方式并行化连接，但是问题是，无法从另一个线程返回Spark DF。我还考虑过广播创建所有可连接单行数据帧的主要数据帧，

spark.sparkContext.broadcast(data)

但这引发与尝试从另一个线程（即另一个线程）返回已联接的DF相同的错误。

PicklingError：无法序列化广播：Py4JError：错误发生在调用o3897时。 getstate 。跟踪：py4j.Py4JException：方法 getstate （[]）不存在

我该如何解决这个问题？

请随时询问您是否需要更多信息。预先感谢。

如何有效地加入数百个Spark数据帧？

0 个答案: