如何应用groupby条件并获取结果中的所有列?

时间:2018-01-31 13:05:58

标签: apache-spark pyspark pyspark-sql

我的数据框看起来像

+-------------------------+-----+
| Title| Status|Suite|ID  |Time |
+------+-------+-----+----+-----+
|KIM   | Passed|ABC  |123 |20   |
|KJT   | Passed|ABC  |123 |10   |
|ZXD   | Passed|CDF  |123 |15   |
|XCV   | Passed|GHY  |113 |36   |
|KJM   | Passed|RTH  |456 |45   |
|KIM   | Passed|ABC  |115 |47   |
|JY    | Passed|JHJK |8963|74   |
|KJH   | Passed|SNMP |256 |47   |
|KJH   | Passed|ABC  |123 |78   |
|LOK   | Passed|GHY  |456 |96   |
|LIM   | Passed|RTH  |113 |78   |
|MKN   | Passed|ABC  |115 |74   |
|KJM   | Passed|GHY  |8963|74   |
+------+-------+-----+----+-----+

可以使用

创建
df = sqlCtx.createDataFrame(
[
    ('KIM', 'Passed', 'ABC', '123',20),
    ('KJT', 'Passed', 'ABC', '123',10),
    ('ZXD', 'Passed', 'CDF', '123',15),
    ('XCV', 'Passed', 'GHY', '113',36),
    ('KJM', 'Passed', 'RTH', '456',45),
    ('KIM', 'Passed', 'ABC', '115',47),
    ('JY', 'Passed', 'JHJK', '8963',74),
    ('KJH', 'Passed', 'SNMP', '256',47),
    ('KJH', 'Passed', 'ABC', '123',78),
    ('LOK', 'Passed', 'GHY', '456',96),
    ('LIM', 'Passed', 'RTH', '113',78),
    ('MKN', 'Passed', 'ABC', '115',74),
    ('KJM', 'Passed', 'GHY', '8963',74),     
],('Title', 'Status', 'Suite', 'ID','Time')

我需要在ID上使用group by,在时间上使用aggregation,在结果中我需要获取标题,状态和&套房以及ID。

我的输出应该是

+-------------------------+-----+
| Title| Status|Suite|  ID|Time |
+------+-------+-----+----+-----+
|KIM   | Passed|ABC  |123 |30.75|
|XCV   | Passed|GHY  |113 |57   |
|KJM   | Passed|RTH  |456 |70.5 | 
|KIM   | Passed|ABC  |115 |60.5 |
|JY    | Passed|JHJK |8963|74   |
|KJH   | Passed|SNMP |256 |47   |
+------+-------+-----+----+-----+

我尝试过以下代码。但它只给了我结果中ID的值

df.groupBy("ID").agg(mean("Time").alias("Time"))

1 个答案:

答案 0 :(得分:2)

使用修改后的预期输出,您可以使用navigationController?.navigationBar.isTranslucent = true 获得任意值:

first

原始回答

看起来你想要from pyspark.sql.functions import avg, first df.groupBy("id").agg( first("Title"), first("Status"), first("Suite"), avg("Time") ).toDF("id", "Title", "Status", "Suite", "Time").show() # +----+-----+------+-----+-----+ # | id|Title|Status|Suite| Time| # +----+-----+------+-----+-----+ # | 113| XCV|Passed| GHY| 57.0| # | 256| KJH|Passed| SNMP| 47.0| # | 456| KJM|Passed| RTH| 70.5| # | 115| KIM|Passed| ABC| 60.5| # |8963| JY|Passed| JHJK| 74.0| # | 123| KIM|Passed| ABC|30.75| # +----+-----+------+-----+-----+

drop_duplicates

如果您想使用特定行,请参阅Find maximum row per group in Spark DataFrame