Question

我正在尝试使用SparkSql提高涉及两个大表的连接的性能。从各种来源，我认为RDD需要进行分区。

来源：https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by

但是，当您从镶木地板文件直接加载文件时，如下所示，我不确定如何将其创建为配对的RDD！

使用Spark 2.0.1，使用“cluster by”无效。

CXCallUpdate *callUpdate = [[CXCallUpdate alloc] init];
//callUpdate.remoteHandle = [[CXHandle alloc] initWithType:CXHandleTypeGeneric value:handle];
[self.provider reportNewIncomingCallWithUUID:uuid update:callUpdate completion:^(NSError* error) {}];

我是否使用“cluster by”键，我仍然看到Spark生成的查询计划相同。如何在spark sql中创建一个rdd对，以便连接可以使用可以分区的表？

如果没有适当的分区，会发生很多洗牌，导致长时间的延迟。

我们的配置（5个工作节点，1个执行器（每个执行器5个核心），每个具有32个核心和128 GB RAM）：

val rawDf1 = spark.read.parquet(“file in hdfs”)
rawDf1 .createOrReplaceTempView(“rawdf1”)

val rawDf2 = spark.read.parquet(“file in hdfs”)
rawDf2 .createOrReplaceTempView(“rawdf2”)

val rawDf3 = spark.read.parquet(“file in hdfs”)
rawDf3 .createOrReplaceTempView(“rawdf3”)

val df1 = spark.sql(“select * from rawdf1 cluster by key)
df1 .createOrReplaceTempView(“df1”)

val df2 = spark.sql(“select * from rawdf2 cluster by key)
df2 .createOrReplaceTempView(“df2”)

val df3 = spark.sql(“select * from rawdf3 cluster by key)
df3 .createOrReplaceTempView(“df3”)

val resultDf = spark.sql(“select * from df1 a inner join df2 b on a.key = b.key inner join df3 c on a.key =c.key”)

添加更多信息：我在同一个select中加入多个表，在所有表中使用相同的键。因此，无法首先创建数据帧来调用repartitionby。我知道我可以使用dataframe api来做到这一点。但我的问题是如何使用普通的sparksql实现这一目标。

如何在SparkSql中有效地加入大表？

0 个答案: