如何使用Spark的repartitionAndSortWithinPartitions?

时间:2016-05-14 13:31:20

标签: scala apache-spark

我正在尝试构建repartitionAndSortWithinPartitions的最小工作示例,以便了解其功能。我到目前为止(不工作,不同的是抛出价值观,以便它们失控)

def partval(partID:Int, iter: Iterator[Int]): Iterator[Tuple2[Int, Int]] = {
  iter.map( x => (partID, x)).toList.iterator
}

val part20to3_chaos = sc.parallelize(1 to 20, 3).distinct
val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(2)
part20to2_sorted.mapPartitionsWithIndex(partval).collect

但得到错误

Name: Compile Error
Message: <console>:22: error: value repartitionAndSortWithinPartitions is not a member of org.apache.spark.rdd.RDD[Int]
             val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(2)

我尝试使用scaladoc,但无法找到哪个类提供repartitionAndSortWithinPartitions。 (顺便说一下:这个scaladoc并不令人印象深刻:为什么MapPartitionsRDD缺失?我如何搜索方法?)

意识到我需要一个分区对象,接下来我试着

val rangePartitioner = new org.apache.spark.RangePartitioner(2, part20to3_chaos)
val part20to2_sorted = part20to3_chaos.repartitionAndSortWithinPartitions(rangePartitioner)
part20to2_sorted.mapPartitionsWithIndex(partval).collect

但得到了

Name: Compile Error
Message: <console>:22: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[Int]
 required: org.apache.spark.rdd.RDD[_ <: Product2[?,?]]
Error occurred in an application involving default arguments.
         val rPartitioner = new org.apache.spark.RangePartitioner(2, part20to3_chaos)

如何编译?我能得到一个有效的例子吗?

2 个答案:

答案 0 :(得分:8)

您的问题是part20to3_chaosRDD[Int],而OrderedRDDFunctions.repartitionAndSortWithinPartitions是一种在RDD[(K, V)]上运行的方法,其中K是关键字,{ {1}}是值。

V将首先根据提供的分区程序重新分区数据,然后按键排序

repartitionAndSortWithinPartitions

所以看起来它并不完全是你正在寻找的东西。

如果您想要一个普通的旧排序,可以使用/** * Repartition the RDD according to the given partitioner and, * within each resulting partition, sort records by their keys. * * This is more efficient than calling `repartition` and then sorting within each partition * because it can push the sorting down into the shuffle machinery. */ def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)] = self.withScope { new ShuffledRDD[K, V, V](self, partitioner).setKeyOrdering(ordering) } ,因为它不需要密钥:

sortBy

您传递scala> val toTwenty = sc.parallelize(1 to 20, 3).distinct toTwenty: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at distinct at <console>:33 scala> val sorted = toTwenty.sortBy(identity, true, 3).collect sorted: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20) 订单(升序或降序)的位置,以及您要创建的分区数。

答案 1 :(得分:2)

让我尝试通过pyspark解释repartitionAndSortWithinPartitions。

假设您有一个配对形式的数据集

pairs  = sc.parallelize([["a",1], ["b",2], ["c",3], ["d",3]])

pairs.collect() 
# Output [['a', 1], ['b', 2], ['c', 3], ['d', 3]]
pairs.repartitionAndSortWithinPartitions(2).glom().collect() 
# Output [[('a', 1), ('c', 3)], [('b', 2), ('d', 3)]]

通过repartitionAndSortWithinPartitions()我们要求数据在2个分区中重新洗牌,这正是我们得到的。 &#39;一个&#39;和&#39; c&#39;作为一个&#39; b&#39;并且&#39; d&#39;另一个。密钥已排序。

我们也可以根据某些条件进行重新分区,如

pairs.repartitionAndSortWithinPartitions(2, 
                                         partitionFunc=lambda x: x == 'a').glom().collect()
# Output [[('b', 2), ('c', 3), ('d', 3)], [('a', 1)]]

正如预期的那样,我们有两个分区,一个分为3个密钥对,另一个分区为(&#39; a&#39;,1)。要了解有关glom的更多信息,请参阅this link