优化Spark代码

时间:2016-10-24 09:11:18

标签: scala apache-spark

我正在努力改进我的Spark代码:

var lst = disOneRDDM.filter(x=> x._2._1 == 1).keys.collect
val disTwoRDDM = disOneRDDM.map(x=> {
                                    var b:Boolean = false
                                    breakable{
                                    for (str <- x._2._2)
                                       if (lst.contains(str))
                                            {b = true
                                            break}
                                    }
                                    if (b)
                                        (x._1,(Math.min(2,x._2._1),x._2._2))
                                    else
                                        x
                                   }).cache

我有表格的RDD(String,(Int,List [String]))。 List[String]中的每个元素在此RDD中都有自己的条目,用作密钥。下面显示了一个示例输入(这是我的代码中的disOneRDDM):

("abc",(10,List("hij","efg","klm")))
("efg",(1,List("jhg","Beethan","abc","ert")))
("Beethan",(0,List("efg","vcx","zse")))
("vcx",(1,List("czx","Beethan","abc")))
("zse",(1,List("efg","Beethan","nbh")))
("hij",(10,List("vcx","klm","zse")))
("jhg",(10,List("ghb","cdz","awq","swq")))
...

我的目的是在每个List[String]中找到Int值为1的元素,并将自己的Int更改为min(2,current_Int_value)。例如,在输入代码中,条目"abc"具有包含"efg"的列表,其中Int值为1,而条目"hij"具有"vcx"。所以我希望输出形式:

("abc",(2,List("hij","efg","klm")))
("efg",(1,List("jhg","Beethan","abc","ert")))
("Beethan",(0,List("efg","vcx","zse")))
("vcx",(1,List("czx","Beethan","abc")))
("zse",(1,List("efg","Beethan","nbh")))
("hij",(2,List("vcx","klm","zse")))
("jhg",(10,List("ghb","cdz","awq","swq")))
...

RDD的大小很大,而且我的工作方式很有效,但速度非常慢。在上面的代码中,我尝试过滤具有Int值1的RDD并通过收集它们来形成列表lst。然后,为了找到Int值为2的元素,我迭代元素的列表条目并检查列表lst是否包含该条目。如果是,我会退出循环并分配适当的Int值 有没有更快的方法来做到这一点,例如,无需在列表中收集巨大的RDD?

2 个答案:

答案 0 :(得分:2)

正如@ a-spoty-spot所评论的那样,如果lst唯一值不是太多 - 您最好的方法是将其更改为Set(其中删除重复项并使用广播。

否则(如果唯一键的列表仍然很大) - 这是一个根本不使用collect的解决方案,这意味着它可以处理任何大小。但是 - 因为它通过使用flatMap来增加RDD的大小并执行join(这需要一个随机播放),我不确定它会更快,这取决于你的具体细节数据和您的集群。

// create the lookup "map" (the int values are actually irrelevant, we just need the keys)
val lookup: RDD[(String, Int)] = disOneRDDM.cache().filter(_._2._1 == 1).map(t => (t._1, 1))

val result = disOneRDDM
  .flatMap { // break up each record into individual records for join
    case (k, (i, list)) => list.map(s => (s, (k, i)))
  }
  .leftOuterJoin(lookup).map { // left join with lookup and change int values if we found a match
    case (item, ((k, i), Some(_))) => (k, (Math.min(2, i), item))
    case (item, ((k, i), _)) => (k, (i, item))
  }
  .groupByKey().map { // group by key to merge back to lists, while mapping to the desired structure
    case (k, iter) =>
      val l = iter.toList
      (k, (l.map(_._1).min, l.map(_._2)))
  }

result.foreach(println)
// (Beethan,(0,List(zse, efg, vcx)))
// (jhg,(10,List(cdz, swq, ghb, awq)))
// (hij,(2,List(klm, zse, vcx)))
// (zse,(1,List(Beethan, nbh, efg)))
// (efg,(1,List(Beethan, jhg, abc, ert)))
// (vcx,(1,List(Beethan, czx, abc)))
// (abc,(2,List(klm, hij, efg)))

答案 1 :(得分:1)

如果您愿意使用Dataframes API而不是RDD - 这是另一个可能会稍微简化代码(并提高性能)的选项:

// UDF to check if string contained in array - will be used for the join
val arrayContains = udf { (a: mutable.WrappedArray[String], s: String) => a.contains(s) }

// create Dataframe from RDD and create the filtered lookupDF
val df = disOneRDDM.map {case (k, (v, l)) => (k, v, l) }.toDF("key", "val", "list").cache()
val lookupDf = df.filter($"val" === 1).select($"key" as "match")

// join, groupBy to remove the duplicates while collecting non-null matches, and perform transformation on "val"
val resultDF = df
.join(lookupDf, arrayContains($"list", $"match"), "leftouter")
.groupBy($"key").agg(
  first("val") as "val",
  first("list") as "list",
  first("match", ignoreNulls = true) as "match")
.selectExpr("key", "IF(match IS NULL OR val < 2, val, 2) as val", "list")

resultDF.show()
// +-------+---+--------------------+
// |    key|val|                list|
// +-------+---+--------------------+
// |    zse|  1| [efg, Beethan, nbh]|
// |    efg|  1|[jhg, Beethan, ab...|
// |    hij|  2|     [vcx, klm, zse]|
// |Beethan|  0|     [efg, vcx, zse]|
// |    vcx|  1| [czx, Beethan, abc]|
// |    abc|  2|     [hij, efg, klm]|
// |    jhg| 10|[ghb, cdz, awq, swq]|
// +-------+---+--------------------+