Question

我有一个包含两列的Dataframe：BrandWatchErwaehnungID和word_counts。 word_counts列是CountVectorizer（稀疏向量）的输出。在删除空行后，我创建了两个新列，其中一列具有稀疏向量的索引，另一列具有其值。

help0 = countedwords_text['BrandWatchErwaehnungID','word_counts'].rdd\
    .filter(lambda x : x[1].indices.size!=0)\
    .map(lambda x : (x[0],x[1],DenseVector(x[1].indices) , DenseVector(x[1].values))).toDF()\
    .withColumnRenamed("_1", "BrandWatchErwaenungID").withColumnRenamed("_2", "word_counts")\
    .withColumnRenamed("_3", "word_indices").withColumnRenamed("_4", "single_word_counts")

我需要在添加到我的Dataframe之前将它们转换为密集向量，因为spark不接受numpy.ndarray。我的问题是，我现在想要在word_indices列中展开该Dataframe，但explode中的pyspark.sql.functions方法仅支持数组或映射为输入。

我试过了：

help1 = help0.withColumn('b' , explode(help0.word_indices))

并收到以下错误：

由于数据类型不匹配，
无法解析'explode（`word_indices'）'：函数爆炸的输入应该是数组或地图类型

之后我尝试了：

help1 = help0.withColumn('b' , explode(help0.word_indices.toArray()))

哪个也没用...... 有什么建议吗？

Answer 1

您必须使用udf：

from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
from pyspark.ml.linalg import *

@udf("array<integer>")
def indices(v):
   if isinstance(v, DenseVector):
      return list(range(len(v)))
   if isinstance(v, SparseVector):
      return v.indices.tolist()

df = spark.createDataFrame([
   (1, DenseVector([1, 2, 3])), (2, SparseVector(5, {4: 42}))], 
   ("id", "v"))

df.select("id", explode(indices("v"))).show()

# +---+---+
# | id|col|
# +---+---+
# |  1|  0|
# |  1|  1|
# |  1|  2|
# |  2|  4|
# +---+---+

在多行中使用密集向量分解列

1 个答案: