Spark: set maximum partition size when joining

时间:2018-12-03 13:21:33

标签: apache-spark

When doing a join in spark, or generally for shuffle operations, I can set the maximum number of partitions, in which I want spark to execute this operation.

As per documentation:

spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.

If I want to lower the amount of work that has to be done in each task, I would have to estimate the total size of data and adjust this parameter accordingly (more partitions means less work done in a single task, but more tasks).

I am wondering, can I tell spark to simply adjust the amount of partitions based on the amount of data? I.e. set the maximum partition size during join operations?

Additional question - how does spark know what is the total size of the datasets to be processed, when doing a repartition into 200 roughly equal partitions?

Thanks in advance!

1 个答案:

答案 0 :(得分:1)

AFAIK,没有这样的选项可以将shuffle分区定位为特定的输出大小。因此,此调整留给您... 在某些情况下,可以在下游读取路径上解决此问题。假设您加入数据并通过hdfs将输出写入实木复合地板。您可以将查询结果重新分区为1(或分区数量很少)。将其视为一个漏斗-将具有200个分区的一些聚合加入联接,然后进一步降低聚合数据的并行性(这应该涉及较小的IO)。假设您的目标是256 MB块大小。选项是输出在其周围,在其下方或在其上方。对于前两种情况,您基本上达到了目标,并且避免了数据碎片过多(对于hdfs,namenode中的块过多)。 但是,如果输出超出目标块大小,这显然会影响下游作业的执行时间,则可以使用spark.sql.files.maxPartitionBytes来控制读取此数据的分区数。因此,即使您有2GB的输出,将此参数设置为128MB也会在读取路径上产生16个分区。

关于第二个问题,spark仅使用哈希分区程序,并计算连接列上的哈希值。当然,您可以使用distribute by影响分区。

相关问题