通过更新同一列来实现滞后功能

时间:2019-03-01 07:39:16

标签: tsql pyspark pyspark-sql lag lead

我必须将条形码(offset=1)的滞后值更新为条形码

case 
  when ( lag(barcode,1) over (order by barcode ) 
        and  Datediff(SS, eventdate,lag(next_eventdate,1) over (order by barcode)) < 3*3600 ) 
  THEN 1 
  ELSE 0 
END as FLAG 

我已经在pyspark上实现了它,但是给我一个错误

from pyspark.sql.functions import col, unix_timestamp
timeDiff = unix_timestamp('eventdate', format="ss")- unix_timestamp(F.lag('next_eventdate', 1), format="ss")
ww= Window.orderBy("barcode") 
Tgt_df_tos = Tgt_df_7.withColumn('FLAG',F.when((F.lag('barcode', 1)) & ( timeDiff <= 10800),"1").otherwise('0'))   

我遇到错误

AnalysisException: "cannot resolve '(lag(`barcode`, 1, NULL) AND ((unix_timestamp(`eventdate`, 'ss') - unix_timestamp(lag(`next_eventdate`, 1, NULL), 'ss')) <= CAST(10800 AS BIGINT)))' due to data type mismatch: differing types in '(lag(`barcode`, 1, NULL) AND ((unix_timestamp(`eventdate`, 'ss') - unix_timestamp(lag(`next_eventdate`, 1, NULL), 'ss')) <= CAST(10800 AS BIGINT)))' (int and boolean).

1 个答案:

答案 0 :(得分:1)

我对pyspark不熟悉,但在我看来问题出在CASE语句中。

CASE WHEN (
        LAG(barcode,1) OVER (ORDER BY barcode ) 
    AND
        DATEDIFF(SS, eventdate, LAG(next_eventdate, 1) OVER(ORDER BY barcode)) < 3*3600
)

有两个表达式: 评估为整数的“ LAG(barcode,1)OVER(ORDER BY条码)”。

“ DATEDIFF(SS,eventdate,LAG(next_eventdate,1)OVER(ORDER BY条码))<3 * 3600”的计算结果为布尔值(由于不等式)。

这些表达式与通常用于组合两个布尔表达式的AND运算符组合。我相信这是导致错误的原因。

LAG(barcode,1)OVER(按条形码排序)的计算结果为INTEGER而不是布尔值。

因此表达式看起来像:

CASE WHEN (324857 AND True) THEN 1 ELSE 0 END as FLAG

AnalysisException: "cannot resolve .... (int and boolean).
相关问题