Question

我正在尝试使用pyspark构建一个kdtree。为此，我正在使用 UDF从二维浮点数列表中递归构建kdtree。

以下是我正在尝试的代码段：

    from pyspark.sql import SparkSession
    from pyspark.sql import Row
    from pyspark.sql.functions import udf
    from pyspark.sql.types import *

    spark = SparkSession.builder.appName("SRDD").getOrCreate()
    sc = spark.sparkContext

    # Some sequence of floats
    abc = [[0.0769,0.2982],[0.0863,0.30052],[0.0690,0.33337],[0.11975,0.2984],[0.07224,0.3467],[0.1316,0.2999]]

    def build_kdtree(points,depth=0):
       n=points.count()
       if n<=0:
          return None
       axis=depth%2
       sorted_points=sorted(points,key=lambda point:point[axis])
       return{
         'point': sorted_points[n/2],
         'left':build_kdtree(sorted_points[:n/2],depth+1),
         'right':build_kdtree(sorted_points[n/2 + 1:],depth+1)
        }
    #This is how I'm trying to specify the return type of the function
    kdtree_schema=StructType([StructField('point',ArrayType(FloatType()),nullable=True),StructField('left',StructType(),nullable=True),StructField('right',StructType(),nullable=True)])
    kdtree_schema=StructType([StructField('point',ArrayType(FloatType()),nullable=True),StructField('left',kdtree_schema,nullable=True),StructField('right',kdtree_schema,nullable=True)])
    #UDF registration
    buildkdtree_udf=udf(build_kdtree, kdtree_schema)

    #Function call
    pointskdtree=buildkdtree_udf(abc)

但是，这将引发TypeError：无效的参数，而不是字符串或列。

我有2个主要问题：

我在Spark中以递归方式构建kd树的方法正确吗？
我将UDF的返回类型指定为kdtree_schema正确的行？

PySpark中的kd树实现

0 个答案: