Spark DataFrame:根据其他列添加新列

时间:2020-07-09 10:01:24

标签: python apache-spark

我想添加一个新列new_col,如果列a的值在yes_list中,那么该值是1的其他new_col 0

from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd = sc.parallelize([{"a":'y'}, {"a":'y', "b":2}, {"a":'n', "c":3}])
rdd_df = sqlContext.read.json(rdd)

yes_list = ['y']

类似这样的东西:

rdd_df.withColumn("new_col", [1 if val in yes_list else 0 for val in rdd_df["a"]])

但以上内容不正确,并引发错误。

TypeError: Column is not iterable

如何实现?

1 个答案:

答案 0 :(得分:0)

您可以对sparkSQL API使用whenisin函数。它将如下:

from pyspark.sql import functions
rdd_df.withColumn("new_col", functions.when(rdd_df['a'].isin(yes_list), 1).otherwise(0)).show()
+---+----+----+-------+                                                         
|  a|   b|   c|new_col|
+---+----+----+-------+
|  y|null|null|      1|
|  y|   2|null|      1|
|  n|null|   3|      0|
+---+----+----+-------+