我正在使用Apache Spark进行一些计算。我像这样运行一些查询。
SELECT country, school, subjects, avg(score) FROM table GROUP BY country,school,subject
结果如下:
USA, school1, math, 99
USA, school1, sport, 98
USA, school2, math, 90
ENG, school1, science, 100
现在每个学校(由country + school_id代表),我们需要根据他们的分数获得前三名。
我正在考虑两种方法。
1. If there is some method called foreachGROUP, Then I will run code like
result.foreachGROUP(get_top_3)
2. I know there is a method called repartion. Then I guess I can do something like :
result.repartion( country,school ) # repartion by country and school
foreachPartion(get_top_3)
我不熟悉Apache spark。所以不确定哪种方式可行或更好。请提出一些建议。如果你有比这更好的方法。还请adivce
答案 0 :(得分:0)
设置测试数据后:
val df = sc.parallelize(Array(
Rec("USA","school1", "math", 98.0),
Rec("USA","school1", "lit", 96.0),
Rec("USA","school1", "trig", 92.0),
Rec("USA","school1", "eng", 94.0)
)).toDF
你做groupBy()
,collect_list()
然后explode
前三名:
val top3bySchool = df.groupBy($"country", $"school")
.agg(collect_list($"subject") as "subjectList", collect_list($"score") as "scoreList")
.explode($"subjectList", $"scoreList"){r => {
val subjectList = r.getSeq[String](0).zip(r.getSeq[Double](1)).sortWith((a,b) => {
a._2 > b._2
});
subjectList.slice(0, if (subjectList.length < 3) subjectList.length else 3);
}}.select($"country",$"school",$"_1" as "subject", $"_2" as "score")