在spark中是否有类似foreachGroup方法的函数?

时间:2016-03-23 02:27:36

标签: apache-spark

我正在使用Apache Spark进行一些计算。我像这样运行一些查询。

SELECT country, school, subjects, avg(score) FROM table GROUP BY country,school,subject

结果如下:

USA, school1, math, 99
USA, school1, sport, 98
USA, school2, math, 90
ENG, school1, science, 100

现在每个学校(由country + school_id代表),我们需要根据他们的分数获得前三名。

我正在考虑两种方法。

1. If there is some method called foreachGROUP, Then I will run code like

result.foreachGROUP(get_top_3)


2. I know there is a method called repartion. Then I guess I can do something like :

result.repartion( country,school ) # repartion by country and school
foreachPartion(get_top_3)

我不熟悉Apache spark。所以不确定哪种方式可行或更好。请提出一些建议。如果你有比这更好的方法。还请adivce

1 个答案:

答案 0 :(得分:0)

设置测试数据后:

val df = sc.parallelize(Array(
  Rec("USA","school1", "math", 98.0),
  Rec("USA","school1", "lit", 96.0),
  Rec("USA","school1", "trig", 92.0),
  Rec("USA","school1", "eng", 94.0)
)).toDF

你做groupBy()collect_list()然后explode前三名:

val top3bySchool = df.groupBy($"country", $"school")
  .agg(collect_list($"subject") as "subjectList", collect_list($"score") as "scoreList")
  .explode($"subjectList", $"scoreList"){r => {
    val subjectList = r.getSeq[String](0).zip(r.getSeq[Double](1)).sortWith((a,b) => {
      a._2 > b._2
    });
    subjectList.slice(0, if (subjectList.length < 3) subjectList.length else 3);
  }}.select($"country",$"school",$"_1" as "subject", $"_2" as "score")
相关问题