Question

我创建了一个UDF但是我需要在UDF中调用一个函数。它目前返回空值。有人可以解释为什么我收到此错误。

a= spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
def get_number(num):
    return range(num)
from pyspark.sql.functions import udf
def cate(label):
    if label == 20:
        counting_list = get_number(4)
        return counting_list
    else:
        return [0]

udf_score=udf(cate, ArrayType(FloatType()))
a.withColumn("category_list", udf_score(a["distances"])).show(10)

out：

+------+---------+--------------------+
|Letter|distances|       category_list|
+------+---------+--------------------+
|     A|       20|[null, null, null...|
|     B|       30|              [null]|
|     D|       80|              [null]|
+------+---------+--------------------+

Answer 1

udf的数据类型不正确，因为cate返回的整数数组不是浮点数。你可以改变一下：

udf_score=udf(cate, ArrayType(FloatType()))

为：

udf_score=udf(cate, ArrayType(IntegerType()))

希望这有帮助！

编辑：假设Python 2.x关于range，因为@Shane Halloran在评论中提及，range在Python 3.x中表现不同

PySpark - 在UDF中调用函数

1 个答案: