具有前一行值的新列

时间:2019-05-10 10:25:12

标签: python dataframe pyspark pyspark-sql

我正在与pyspark合作,我有这样的框架

这是我的框架

+---+-----+
| id|value|
+---+-----+
|  1|   65|
|  2|   66|
|  3|   65|
|  4|   68|
|  5|   71|
+---+-----+

我想像这样用pyspark生成框架

+---+-----+-------------+
| id|value| prev_value  |
+---+-----+-------------+
| 1 | 65  | null        |
| 2 | 66  | 65          |
| 3 | 65  | 66,65       |
| 4 | 68  | 65,66,65    |
| 5 | 71  | 68,65,66,65 |
+---+-----+-------------+

1 个答案:

答案 0 :(得分:0)

这是一种方法:

from pyspark.sql.window import Window
from pyspark.sql.types import StringType

# define window and calculate "running total" of lagged value
win = Window.partitionBy().orderBy(f.col('id'))
df = df.withColumn('prev_value', f.collect_list(f.lag('value').over(win)).over(win))

# now define udf to concatenate the lists
concat = f.udf(lambda x: 'null' if len(x)==0 else ','.join([str(elt) for elt in x[::-1]]))
df = df.withColumn('prev_value', concat('prev_value'))