Question

我正在尝试找到一种优化的方法来生成唯一的共址配对列表。我已经看过使用一系列平面图和不同的查询来做到这一点，但我发现平板图在运行数百万条记录时不会过于高效。我们将非常感谢您对优化此项工作的任何帮助。

数据集是（geohash，id），我在30 Node Cluster上运行它。

val rdd = sc.parallelize(Seq(("gh5", "id1"), ("gh4", "id1"), ("gh5", "id2"),("gh5", "id3"))

val uniquePairings = rdd.groupByKey().map(value =>
     value._2.toList.sorted.combinations(2).map{
     case Seq(x, y) => (x, y)}.filter(id => 
     id._1 != id._2)).flatMap(x => x).distinct()       

voutput = Array(("id1","id2"),("id1","id3"),("id2","id3"))

Answer 1

简单的join应该绰绰有余。例如DataFrames：

val df = rdd.toDF
df.as("df1").join(df.as("df2"),
  ($"df1._1" === $"df2._1") && 
  ($"df1._2" < $"df2._2")
).select($"df1._2", $"df2._2")

或数据集

val ds = rdd.toDS
ds.as("ds1").joinWith(ds.as("ds2"),
  ($"ds1._1" === $"ds2._1") && 
  ($"ds1._2" < $"ds2._2")
).map{ case ((_, x), (_, y)) => (x, y)}

Answer 2

查看笛卡尔函数。它产生一个RDD，它是输入RDD的所有可能组合。请注意，这是一项昂贵的操作（RDD大小为N ^ 2）

Cartesian example

Spark任务优化

2 个答案: