计算pyspark中不同事件类型的日期之间的差异

时间:2016-06-06 21:00:12

标签: count pyspark pyspark-sql

我正在尝试在事件数据中计算pyspark中的Datediff和count_diff。

数据看起来像这样

deviceid  techid name count   load_date
m1          1     a    30    23-01-2016
m2          1     b    40    23-01-2016
m1          1     a    45    29-01-2016
m1          2     a    50    30-01-2016

我希望它看起来像这样

deviceid  techid name count   load_date   datediff  countdiff
m1          1     a    30    23-01-2016    NA         NA
m2          1     b    40    23-01-2016    NA         NA 
m1          1     a    45    29-01-2016    6          15
m1          2     a    50    30-01-2016    NA         NA

如何在pyspark中创建一个包含这些值的列,并根据事件条件的变化采用约会。

1 个答案:

答案 0 :(得分:0)

这可以使用开窗功能解决。

(1)使用示例测试数据创建数据框

df = spark.createDataFrame([('m1',1,'a',30,'23-01-2016'),('m2',1,'b',40,'23-01-2016'),('m1',1,'a',45,'29-01-2016'),('m1',2,'a',50,'30-01-2016')], ['deviceid','techid','name','count','load_date'])

df1 = df.selectExpr("deviceid","techid","name","count","to_timestamp(load_date, 'dd-MM-yyyy') AS load_date")

(2)定义窗口,并使用lag函数构建先前的计数和先前的加载日期列逻辑

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,lag

windowSpec = Window.partitionBy('deviceid','techid').orderBy('load_date')
prev_count = lag('count').over(windowSpec).alias('rank')
prev_load_date = lag('load_date').over(windowSpec).alias('rank')

df2 = df1.withColumn("prev_count", prev_count) \
    .withColumn("prev_load_date", prev_load_date)

(3)用原始列减去原始列以计算差异。

df2.selectExpr("deviceid",
               "techid",
               "name",
               "count",
               "load_date",
               "datediff(load_date,prev_load_date) AS datediff",
               "(count - prev_count) AS countdiff")\
    .show()

#+--------+------+----+-----+-------------------+--------+---------+
#|deviceid|techid|name|count|          load_date|datediff|countdiff|
#+--------+------+----+-----+-------------------+--------+---------+
#|      m1|     1|   a|   30|2016-01-23 00:00:00|    null|     null|
#|      m1|     1|   a|   45|2016-01-29 00:00:00|       6|       15|
#|      m1|     2|   a|   50|2016-01-30 00:00:00|    null|     null|
#|      m2|     1|   b|   40|2016-01-23 00:00:00|    null|     null|
#+--------+------+----+-----+-------------------+--------+---------+