Pyspark:找到第一次出现的最大值

时间:2018-05-15 08:52:03

标签: apache-spark pyspark pyspark-sql

我的数据集包含多个车轮记录的速度,作为时间的函数。每辆车都有一个特定的身份证。

数据如下:

+-----------------+-----------+------+
|        timestamp|         ID| speed|
+-----------------+-----------+------+
|1.485320164625E12|-2140210972|139.25|
| 1.48532016475E12|-2140210972| 139.5|
|1.485320164875E12|-2140210972| 140.0|
|   1.485320165E12|-2140210972| 141.5|
|1.485320165125E12|-2140210972| 142.0|
| 1.48532016525E12|-2140210972|141.75|
|1.485320165375E12|-2140210972|141.25|
|  1.4853201655E12|-2140210972| 142.5|
|1.485320165625E12|-2140210972|142.75|
| 1.48532016575E12|-2140210972| 143.0|
|1.485320165875E12|-2140210972|143.75|
|   1.485320166E12|-2140210972| 144.5|
|1.485320166125E12|-2140210972| 144.0|
| 1.48532016625E12|-2140210972|144.75|
|1.485320166375E12|-2140210972| 144.5|
|  1.4853201665E12|-2140210972| 145.5|
|1.485320166625E12|-2140210972|145.75|
| 1.48532016675E12|-2140210972|144.25|
|1.485320166875E12|-2140210972|145.25|
|   1.485320167E12|-2140210972| 144.5|
+-----------------+-----------+------+
only showing top 20 rows

我想找到最大速度并获得此最大值的第一个时间戳。

我尝试了以下内容:

from pyspark.sql import functions as F
df.groupBy("ID").agg(F.first(F.max("speed"))).show()

但是我收到以下错误:

'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query'

我想过做类似的事情:

win = Window.partitionBy("ID", "speed").orderBy("timestamp")
F.rank(df.speed).over(win)
F.max(df.speed).over(Window.partitionBy("ID")
result = df.filter(df.speed == max(speed) (for rank ==1)

但对于这么简单的操作来说,这似乎过于复杂,不是吗?

0 个答案:

没有答案