根据其他列替换pyspark列

时间:2019-03-21 14:27:43

标签: pandas pyspark apache-spark-sql

在“数据”数据框中,我有2列“ time_stamp”和“ hour”。我想在缺少“ time_stamp”值的地方插入“小时”列值。我不想创建新列,而是填写“ time_stamp”中的缺失值

我想做的是将这个熊猫代码替换为pyspark代码:

data['time_stamp'] = data.apply(lambda x: x['hour'] if pd.isna(x['time_stamp']) else x['time_stamp'], axis=1) 

1 个答案:

答案 0 :(得分:1)

类似的事情应该起作用

from pyspark.sql import functions as f

df = (df.withColumn('time_stamp',
 f.expr('case when time_stamp is null then hour else timestamp'))) #added ) which you mistyped

或者,如果您不喜欢sql:

df = df.withColumn('time_stamp', f.when(f.col('time_stamp').isNull(),f.col('hour'))).otherwise(f.col('timestamp')) # Please correct the Brackets