Question

我的数据集包含多个车轮记录的速度，作为时间的函数。每辆车都有一个特定的身份证。

数据如下：

+-----------------+-----------+------+
|        timestamp|         ID| speed|
+-----------------+-----------+------+
|1.485320164625E12|-2140210972|139.25|
| 1.48532016475E12|-2140210972| 139.5|
|1.485320164875E12|-2140210972| 140.0|
|   1.485320165E12|-2140210972| 141.5|
|1.485320165125E12|-2140210972| 142.0|
| 1.48532016525E12|-2140210972|141.75|
|1.485320165375E12|-2140210972|141.25|
|  1.4853201655E12|-2140210972| 142.5|
|1.485320165625E12|-2140210972|142.75|
| 1.48532016575E12|-2140210972| 143.0|
|1.485320165875E12|-2140210972|143.75|
|   1.485320166E12|-2140210972| 144.5|
|1.485320166125E12|-2140210972| 144.0|
| 1.48532016625E12|-2140210972|144.75|
|1.485320166375E12|-2140210972| 144.5|
|  1.4853201665E12|-2140210972| 145.5|
|1.485320166625E12|-2140210972|145.75|
| 1.48532016675E12|-2140210972|144.25|
|1.485320166875E12|-2140210972|145.25|
|   1.485320167E12|-2140210972| 144.5|
+-----------------+-----------+------+
only showing top 20 rows

我想找到最大速度并获得此最大值的第一个时间戳。

我尝试了以下内容：

from pyspark.sql import functions as F
df.groupBy("ID").agg(F.first(F.max("speed"))).show()

但是我收到以下错误：

'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query'

我想过做类似的事情：

win = Window.partitionBy("ID", "speed").orderBy("timestamp")
F.rank(df.speed).over(win)
F.max(df.speed).over(Window.partitionBy("ID")
result = df.filter(df.speed == max(speed) (for rank ==1)

但对于这么简单的操作来说，这似乎过于复杂，不是吗？

Pyspark：找到第一次出现的最大值

0 个答案: