Question

我正在使用Apache Spark进行一些计算。我像这样运行一些查询。

SELECT country, school, subjects, avg(score) FROM table GROUP BY country,school,subject

结果如下：

USA, school1, math, 99
USA, school1, sport, 98
USA, school2, math, 90
ENG, school1, science, 100

现在每个学校（由country + school_id代表），我们需要根据他们的分数获得前三名。

我正在考虑两种方法。

1. If there is some method called foreachGROUP, Then I will run code like

result.foreachGROUP(get_top_3)


2. I know there is a method called repartion. Then I guess I can do something like :

result.repartion( country,school ) # repartion by country and school
foreachPartion(get_top_3)

我不熟悉Apache spark。所以不确定哪种方式可行或更好。请提出一些建议。如果你有比这更好的方法。还请adivce

Answer 1

设置测试数据后：

val df = sc.parallelize(Array(
  Rec("USA","school1", "math", 98.0),
  Rec("USA","school1", "lit", 96.0),
  Rec("USA","school1", "trig", 92.0),
  Rec("USA","school1", "eng", 94.0)
)).toDF

你做groupBy()，collect_list()然后explode前三名：

val top3bySchool = df.groupBy($"country", $"school")
  .agg(collect_list($"subject") as "subjectList", collect_list($"score") as "scoreList")
  .explode($"subjectList", $"scoreList"){r => {
    val subjectList = r.getSeq[String](0).zip(r.getSeq[Double](1)).sortWith((a,b) => {
      a._2 > b._2
    });
    subjectList.slice(0, if (subjectList.length < 3) subjectList.length else 3);
  }}.select($"country",$"school",$"_1" as "subject", $"_2" as "score")

在spark中是否有类似foreachGroup方法的函数？

1 个答案: