When doing a join in spark, or generally for shuffle operations, I can set the maximum number of partitions, in which I want spark to execute this operation.
As per documentation:
spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.
If I want to lower the amount of work that has to be done in each task, I would have to estimate the total size of data and adjust this parameter accordingly (more partitions means less work done in a single task, but more tasks).
I am wondering, can I tell spark to simply adjust the amount of partitions based on the amount of data? I.e. set the maximum partition size during join operations?
Additional question - how does spark know what is the total size of the datasets to be processed, when doing a repartition into 200 roughly equal partitions?
Thanks in advance!
答案 0 :(得分:1)
AFAIK,没有这样的选项可以将shuffle分区定位为特定的输出大小。因此,此调整留给您...
在某些情况下,可以在下游读取路径上解决此问题。假设您加入数据并通过hdfs将输出写入实木复合地板。您可以将查询结果重新分区为1(或分区数量很少)。将其视为一个漏斗-将具有200个分区的一些聚合加入联接,然后进一步降低聚合数据的并行性(这应该涉及较小的IO)。假设您的目标是256 MB块大小。选项是输出在其周围,在其下方或在其上方。对于前两种情况,您基本上达到了目标,并且避免了数据碎片过多(对于hdfs,namenode中的块过多)。
但是,如果输出超出目标块大小,这显然会影响下游作业的执行时间,则可以使用spark.sql.files.maxPartitionBytes
来控制读取此数据的分区数。因此,即使您有2GB的输出,将此参数设置为128MB也会在读取路径上产生16个分区。
关于第二个问题,spark仅使用哈希分区程序,并计算连接列上的哈希值。当然,您可以使用distribute by影响分区。