Question

在“数据”数据框中，我有2列“ time_stamp”和“ hour”。我想在缺少“ time_stamp”值的地方插入“小时”列值。我不想创建新列，而是填写“ time_stamp”中的缺失值

我想做的是将这个熊猫代码替换为pyspark代码：

data['time_stamp'] = data.apply(lambda x: x['hour'] if pd.isna(x['time_stamp']) else x['time_stamp'], axis=1)

Answer 1

类似的事情应该起作用

from pyspark.sql import functions as f

df = (df.withColumn('time_stamp',
 f.expr('case when time_stamp is null then hour else timestamp'))) #added ) which you mistyped

或者，如果您不喜欢sql：

df = df.withColumn('time_stamp', f.when(f.col('time_stamp').isNull(),f.col('hour'))).otherwise(f.col('timestamp')) # Please correct the Brackets

根据其他列替换pyspark列

1 个答案: