Question

我想添加一个新列new_col，如果列a的值在yes_list中，那么该值是1的其他new_col 0

from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd = sc.parallelize([{"a":'y'}, {"a":'y', "b":2}, {"a":'n', "c":3}])
rdd_df = sqlContext.read.json(rdd)

yes_list = ['y']

类似这样的东西：

rdd_df.withColumn("new_col", [1 if val in yes_list else 0 for val in rdd_df["a"]])

但以上内容不正确，并引发错误。

TypeError: Column is not iterable

如何实现？

Answer 1

您可以对sparkSQL API使用when和isin函数。它将如下：

from pyspark.sql import functions
rdd_df.withColumn("new_col", functions.when(rdd_df['a'].isin(yes_list), 1).otherwise(0)).show()

+---+----+----+-------+                                                         
|  a|   b|   c|new_col|
+---+----+----+-------+
|  y|null|null|      1|
|  y|   2|null|      1|
|  n|null|   3|      0|
+---+----+----+-------+

Spark DataFrame：根据其他列添加新列

1 个答案: