Question

我正在尝试构建repartitionAndSortWithinPartitions的最小工作示例，以便了解其功能。我到目前为止（不工作，不同的是抛出价值观，以便它们失控）

def partval(partID:Int, iter: Iterator[Int]): Iterator[Tuple2[Int, Int]] = {
  iter.map( x => (partID, x)).toList.iterator
}

val part20to3_chaos = sc.parallelize(1 to 20, 3).distinct
val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(2)
part20to2_sorted.mapPartitionsWithIndex(partval).collect

但得到错误

Name: Compile Error
Message: <console>:22: error: value repartitionAndSortWithinPartitions is not a member of org.apache.spark.rdd.RDD[Int]
             val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(2)

我尝试使用scaladoc，但无法找到哪个类提供repartitionAndSortWithinPartitions。（顺便说一下：这个scaladoc并不令人印象深刻：为什么MapPartitionsRDD缺失？我如何搜索方法？）

意识到我需要一个分区对象，接下来我试着

val rangePartitioner = new org.apache.spark.RangePartitioner(2, part20to3_chaos)
val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(rangePartitioner)
part20to2_sorted.mapPartitionsWithIndex(partval).collect

但得到了

Name: Compile Error
Message: <console>:22: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[Int]
 required: org.apache.spark.rdd.RDD[_ <: Product2[?,?]]
Error occurred in an application involving default arguments.
         val rPartitioner = new org.apache.spark.RangePartitioner(2, part20to3_chaos)

如何编译？我能得到一个有效的例子吗？

Answer 1

您的问题是part20to3_chaos是RDD[Int]，而OrderedRDDFunctions.repartitionAndSortWithinPartitions是一种在RDD[(K, V)]上运行的方法，其中K是关键字，{ {1}}是值。

V将首先根据提供的分区程序重新分区数据，然后按键排序：

repartitionAndSortWithinPartitions

所以看起来它并不完全是你正在寻找的东西。

如果您想要一个普通的旧排序，可以使用/** * Repartition the RDD according to the given partitioner and, * within each resulting partition, sort records by their keys. * * This is more efficient than calling `repartition` and then sorting within each partition * because it can push the sorting down into the shuffle machinery. */ def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope { new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering) }，因为它不需要密钥：

sortBy

您传递scala> val toTwenty = sc.parallelize(1 to 20, 3).distinct toTwenty: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at distinct at <console>:33 scala> val sorted = toTwenty.sortBy(identity, true, 3).collect sorted: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)订单（升序或降序）的位置，以及您要创建的分区数。

Answer 2

让我尝试通过pyspark解释repartitionAndSortWithinPartitions。

假设您有一个配对形式的数据集

pairs  = sc.parallelize([["a",1], ["b",2], ["c",3], ["d",3]])

pairs.collect() 
# Output [['a', 1], ['b', 2], ['c', 3], ['d', 3]]
pairs.repartitionAndSortWithinPartitions(2).glom().collect() 
# Output [[('a', 1), ('c', 3)], [('b', 2), ('d', 3)]]

通过repartitionAndSortWithinPartitions（）我们要求数据在2个分区中重新洗牌，这正是我们得到的。＆＃39;一个＆＃39;和＆＃39; c＆＃39;作为一个＆＃39; b＆＃39;并且＆＃39; d＆＃39;另一个。密钥已排序。

我们也可以根据某些条件进行重新分区，如

pairs.repartitionAndSortWithinPartitions(2, 
                                         partitionFunc=lambda x: x == 'a').glom().collect()
# Output [[('b', 2), ('c', 3), ('d', 3)], [('a', 1)]]

正如预期的那样，我们有两个分区，一个分为3个密钥对，另一个分区为（＆＃39; a＆＃39;，1）。要了解有关glom的更多信息，请参阅this link

如何使用Spark的repartitionAndSortWithinPartitions？

2 个答案: