如何展平结构数组类型的列(由Spark ML API返回)?

时间:2017-10-13 18:32:26

标签: apache-spark apache-spark-sql apache-spark-ml

也许只是因为我对API相对较新,但我觉得Spark ML方法通常会返回不必要的DF。

这一次,ALS模型正在绊倒我。具体来说,是recommendedForAllUsers方法。让我们重新构建它将返回的DF的类型:

scala> val arrayType = ArrayType(new StructType().add("itemId", IntegerType).add("rating", FloatType))

scala> val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
  toDF("userId", "recommendations").
  select($"userId", $"recommendations".cast(arrayType))

scala> recs.show()
+------+------------------+
|userId|   recommendations|
+------+------------------+
|     1|[[1,0.7], [2,0.5]]|
|     2|[[0,0.9], [4,0.1]]|
+------+------------------+
scala> recs.printSchema
root
 |-- userId: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- itemId: integer (nullable = true)
 |    |    |-- rating: float (nullable = true)

现在,我只关心itemId列中的recommendations。毕竟,方法是recommendForAllUsers而不是recommendAndScoreForAllUsers(好的,我会不再是时髦......)

我该怎么做?

当我创建UDF时,我以为我拥有它:

scala> val itemIds = udf((arr: Array[(Int, Float)]) => arr.map(_._1))

但是会产生错误:

scala> recs.withColumn("items", items($"recommendations"))
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(recommendations)' due to data type mismatch: argument 1 requires array<struct<_1:int,_2:float>> type, however, '`recommendations`' is of array<struct<itemId:int,rating:float>> type.;;
'Project [userId#87, recommendations#92, UDF(recommendations#92) AS items#238]
+- Project [userId#87, cast(recommendations#88 as array<struct<itemId:int,rating:float>>) AS recommendations#92]
   +- Project [_1#84 AS userId#87, _2#85 AS recommendations#88]
      +- LocalRelation [_1#84, _2#85]

有什么想法吗?谢谢!

2 个答案:

答案 0 :(得分:5)

哇,我的同事提出了一个非常优雅的解决方案:

.select{
   height:100px;
   overflow:scroll;
}

所以也许Spark ML API毕竟不是那么困难:)

答案 1 :(得分:3)

使用数组作为列的类型,例如recommendations,使用explode函数(或更高级的flatMap运算符)可以提高效率。

  

explode(e:Column):Column 为给定数组或地图列中的每个元素创建一个新行。

这可以让你使用简单的结构。

import org.apache.spark.sql.types._
val structType = new StructType().
  add($"itemId".int).
  add($"rating".float)
val arrayType = ArrayType(structType)
val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
  toDF("userId", "recommendations").
  select($"userId", $"recommendations" cast arrayType)

val exploded = recs.withColumn("recs", explode($"recommendations"))
scala> exploded.show
+------+------------------+-------+
|userId|   recommendations|   recs|
+------+------------------+-------+
|     1|[[1,0.7], [2,0.5]]|[1,0.7]|
|     1|[[1,0.7], [2,0.5]]|[2,0.5]|
|     2|[[0,0.9], [4,0.1]]|[0,0.9]|
|     2|[[0,0.9], [4,0.1]]|[4,0.1]|
+------+------------------+-------+
使用select(星号)的*运算符

结构很好,可以根据结构字段将它们展平为列。

你可以select($"element.*")

scala> exploded.select("userId", "recs.*").show
+------+------+------+
|userId|itemId|rating|
+------+------+------+
|     1|     1|   0.7|
|     1|     2|   0.5|
|     2|     0|   0.9|
|     2|     4|   0.1|
+------+------+------+

我认为这可以做你想要的事情。

P.S。尽可能远离UDF,因为它们“触发”从内部格式(InternalRow)到JVM对象的行转换,这可能导致过多的GC。