apache-spark - Spark: set maximum partition size when joining

When doing a join in spark, or generally for shuffle operations, I can set the maximum number of partitions, in which I want spark to execute this operation.

As per documentation:

spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.

If I want to lower the amount of work that has to be done in each task, I would have to estimate the total size of data and adjust this parameter accordingly (more partitions means less work done in a single task, but more tasks).

I am wondering, can I tell spark to simply adjust the amount of partitions based on the amount of data? I.e. set the maximum partition size during join operations?

Additional question - how does spark know what is the total size of the datasets to be processed, when doing a repartition into 200 roughly equal partitions?

Thanks in advance!

AFAIK，没有这样的选项可以将shuffle分区定位为特定的输出大小。因此，此调整留给您... 在某些情况下，可以在下游读取路径上解决此问题。假设您加入数据并通过hdfs将输出写入实木复合地板。您可以将查询结果重新分区为1（或分区数量很少）。将其视为一个漏斗-将具有200个分区的一些聚合加入联接，然后进一步降低聚合数据的并行性（这应该涉及较小的IO）。假设您的目标是256 MB块大小。选项是输出在其周围，在其下方或在其上方。对于前两种情况，您基本上达到了目标，并且避免了数据碎片过多（对于hdfs，namenode中的块过多）。但是，如果输出超出目标块大小，这显然会影响下游作业的执行时间，则可以使用spark.sql.files.maxPartitionBytes来控制读取此数据的分区数。因此，即使您有2GB的输出，将此参数设置为128MB也会在读取路径上产生16个分区。

关于第二个问题，spark仅使用哈希分区程序，并计算连接列上的哈希值。当然，您可以使用distribute by影响分区。

Spark: set maximum partition size when joining

1 个答案: