以下是我正在处理的csv文件示例:
life id,policy id,benefit id,date of commencment,status
xx_0,0,0,11/11/2017,active
xx_0,0,0,12/12/2017,active
axb_0,1,0,10/01/2015,active
axb_0,1,0,11/10/2014,active
fxa_2,0,1,01/02/203,active
我想要做的是分组(lifeid
+ policyid
+ benefitid
)数据并按日期排序,然后将每个组的最近(最后)元素带到做一些控制。
在火花上做这件事的最佳方法是什么?
答案 0 :(得分:1)
在spark中执行此操作的最佳方法可能是使用数据框(请参阅How to select the first row of each group?)。但我读到你想避免使用它们。纯RDD解决方案可以编写如下:
val rdd = sc.parallelize(Seq("xx_0,0,0,11/11/2017,active",
"xx_0,0,0,12/12/2017,active",
"axb_0,1,0,10/01/2015,active",
"axb_0,1,0,11/10/2014,active",
"fxa_2,0,1,01/02/203,active"))
rdd
.map(_.split(","))
.map(x=> x.slice(0,3).reduce(_+","+_) ->
(new SimpleDateFormat("dd/MM/yyyy").parse(x(3)).getTime, x(4)))
.reduceByKey((a,b) => if(a._1 > b._1) a else b)
.map(x=> x._1+","+x._2._1+","+x._2._2)
.collect.foreach(println)