将稀疏特征向量分解为单独的列

时间:2018-01-30 14:43:15

标签: scala apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

在我的spark DataFrame中,我有一个包含CountVectoriser转换输出的列 - 它是稀疏矢量格式。我想要做的是将这个列再次“爆炸”成一个密集的向量,然后是它的组件行(这样它就可以用于外部模型的评分)。

我知道该列中有40个功能,因此按照this示例,我尝试过:

import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.linalg.Vector

// convert sparse vector to a dense vector, and then to array<double> 
val vecToSeq = udf((v: Vector) => v.toArray)

// Prepare a list of columns to create
val exprs = (0 until 39).map(i => $"_tmp".getItem(i).alias(s"exploded_col$i"))
testDF.select(vecToSeq($"features").alias("_tmp")).select(exprs:_*)

然而,我得到了奇怪的错误(见下面的完整错误):

data type mismatch: argument 1 requires vector type, however, 'features' is of vector type.;

现在似乎可能是CountVectoriser创建了一个'ml.linalg.Vector'类型的向量,所以我还是尝试了导入:

import org.apache.spark.ml.linalg.{Vector, DenseVector, SparseVector}

然后我收到错误引起:

Caused by: java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.sql.Row

我还尝试通过将UDF更改为:

来转换ml矢量
val vecToSeq = udf((v: Vector) =>  org.apache.spark.mllib.linalg.Vectors.fromML(v.toDense).toArray )

并收到类似的cannot be cast to org.apache.spark.sql.Row错误。谁能告诉我为什么这不起作用?有没有更简单的方法将DataFrame中的稀疏矢量分解为sperate列?我花了好几个小时就想不出来了。

编辑:架构将功能列显示为向量:

  |-- features: vector (nullable = true)

完整错误跟踪:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features)' due to data type mismatch: argument 1 requires vector type, however, 'features' is of vector type.;;
Project [UDF(features#325) AS _tmp#463]
. . . 
org.apache.spark.sql.cassandra.CassandraSourceRelation@47eae91d

        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
        at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
        at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
        at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
        at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
        at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
        at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
        at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
        at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
        at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
        at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2872)
        at org.apache.spark.sql.Dataset.select(Dataset.scala:1153)
        at uk.nominet.renewals.prediction_test$.prediction_test(prediction_test.scala:292)
        at 

2 个答案:

答案 0 :(得分:3)

在处理此类案件时,我经常会逐步分解,以了解问题的来源。

首先,让我们设置一个数据帧:

import org.apache.spark.ml.feature.CountVectorizer
import org.apache.spark.ml.linalg.Vector
val df=sc.parallelize(Seq((1L, Seq("word1", "word2")))).toDF("id", "words")
val countModel = new CountVectorizer().setInputCol("words").setOutputCol("feature").fit(df)
val testDF = countModel.transform(df)
testDF.show

+---+--------------+-------------------+
| id|         words|            feature|
+---+--------------+-------------------+
|  1|[word1, word2]|(2,[0,1],[1.0,1.0])|
+---+--------------+-------------------+

现在,我想要选择,比如第一列特征,也就是说,提取feature向量的第一个坐标。

可以这样写:v(0)。 现在我希望我的数据框有一个包含v(0)的列,v列是feature列的内容。我可以使用用户定义函数:

val firstColumnExtractor = udf((v: Vector) => v(0))

我尝试将此列添加到testDF

testDF.withColumn("feature_0", firstColumnExtractor($"feature")).show
+---+--------------+-------------------+---------+                              
| id|         words|            feature|feature_0|
+---+--------------+-------------------+---------+
|  1|[word1, word2]|(2,[0,1],[1.0,1.0])|      1.0|
+---+--------------+-------------------+---------+

请注意,我也可以这样做(这只是风格问题,据我所知):

testDF.select(firstColumnExtractor($"feature").as("feature_0")).show

这样可行,但这需要重复很多工作。让我们自动化。 首先,我可以将提取函数推广到任何索引。让我们创建一个更高阶的函数(一个创建函数的函数)

def columnExtractor(idx: Int) = udf((v: Vector) => v(idx))

现在,我可以重写上一个例子:

testDF.withColumn("feature_0", columnExtractor(0)($"feature")).show

好的,现在我可以这样做:

testDF.withColumn("feature_0", columnExtractor(0)($"feature"))
      .withColumn("feature_1", columnExtractor(1)($"feature"))

适用于1,但39个维度呢?好吧,让我们再自动化一些。以上内容实际上是每个维度的foldLeft操作:

(0 to 39).foldLeft(testDF)((df, idx) => df.withColumn("feature_"+idx, columnExtractor(idx)($"feature")))

这是使用多个选择编写函数的另一种方法

val featureCols = (0 to 1).map(idx => columnExtractor(idx)($"feature").as("feature_"+idx))
testDF.select((col("*") +: featureCols):_*).show
+---+--------------+-------------------+---------+---------+
| id|         words|            feature|feature_0|feature_1|
+---+--------------+-------------------+---------+---------+
|  1|[word1, word2]|(2,[0,1],[1.0,1.0])|      1.0|      1.0|
+---+--------------+-------------------+---------+---------+

现在,出于性能原因,您可能希望将基础Vector转换为坐标数组(或DenseVector)。随意这样做。我觉得DenseVectorArray可能会在表现方面非常接近,所以我会这样写:

// A function to densify the feature vector
val toDense = udf((v:Vector) => v.toDense)
// Replase testDF's feature column with its dense equivalent
val denseDF = testDF.withColumn("feature", toDense($"feature"))
// Work on denseDF as we did on testDF 
denseDF.select((col("*") +: featureCols):_*).show

答案 1 :(得分:1)

您的import语句似乎存在问题。如您所见,ml将使用mllib包向量,因此,所有向量导入也应使用此包。确保使用较旧的import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.SparseVector import org.apache.spark.mllib.linalg.DenseVector 没有任何导入。这包括:

mllib

ml包中仅存在一些方法,因此在您实际需要使用此类型的向量的情况下,您可以重命名它们(因为名称与import org.apache.spark.mllib.linalg.{Vector => mllibVector} 相同否则)。例如:

val df = Seq((1L, Seq("word1", "word2", "word3")), (2L, Seq("word2", "word4"))).toDF("id", "words")
val countVec = new CountVectorizer().setInputCol("words").setOutputCol("features")
val testDF = countVec.fit(df).transform(df)

修复所有导入后,您的代码应该运行。测试:

+---+--------------------+--------------------+
| id|               words|            features|
+---+--------------------+--------------------+
|  1|[word1, word2, wo...|(4,[0,2,3],[1.0,1...|
|  2|      [word2, word4]| (4,[0,1],[1.0,1.0])|
+---+--------------------+--------------------+

将提供如下测试数据框:

val vecToSeq = udf((v: Vector) => v.toArray)

val exprs = (0 until 4).map(i => $"features".getItem(i).alias(s"exploded_col$i"))
val df2 = testDF.withColumn("features", vecToSeq($"features")).select(exprs:_*)

现在给每个索引它自己的列:

+-------------+-------------+-------------+-------------+
|exploded_col0|exploded_col1|exploded_col2|exploded_col3|
+-------------+-------------+-------------+-------------+
|          1.0|          0.0|          1.0|          1.0|
|          1.0|          1.0|          0.0|          0.0|
+-------------+-------------+-------------+-------------+

结果dataFfame:

<div class="my-3 text-center">
              <span class="badge badge-primary">HTML</span>
              <span class="badge badge-success">CSS</span>
          <span class="badge badge-secondary">Javascript</span><span class="badge badge-info">React</span></div>