如何从Dataframe的Map中获取Wrappedarray

时间:2016-11-28 10:43:24

标签: scala apache-spark spark-dataframe udf

我有一个像这样的数据帧:

    +------+------------------------------------------------------------------------------+
|myKeys|myMaps                                                                        |
+------+------------------------------------------------------------------------------+
|b     |Map(b -> WrappedArray([1,o], [4,xxx]), a -> WrappedArray([1,o], [1,n], [1,n]))|
|a     |Map(b -> WrappedArray([1,o], [4,n]), a -> WrappedArray([4,c], [1,n], [1,n]))  |
|a     |Map(b -> WrappedArray([4,o], [3,n]), a -> WrappedArray([4,o], [1,n], [1,n]))  |
|b     |Map(b -> WrappedArray([4,a], [3,n]), a -> WrappedArray([1,o], [1,n], [1,n]))  |
+------+------------------------------------------------------------------------------+

使用此架构

    root
 |-- myKeys: string (nullable = false)
 |-- myMaps: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _1: string (nullable = true)
 |    |    |    |-- _2: string (nullable = true)

以下是创建它的代码:

val x = sc.parallelize(Seq(
      Array(("a", "1", "o"), ("a", "1", "n"), ("b", "1", "o"), ("a", "1", "n"), ("b", "4", "xxx")),
      Array(("a", "1", "o"), ("a", "1", "n"), ("b", "1", "o"), ("a", "1", "n"), ("b", "4", "n")),
      Array(("a", "1", "o"), ("a", "1", "n"), ("b", "4", "o"), ("a", "1", "n"), ("b", "3", "n")),
      Array(("a", "1", "o"), ("a", "1", "n"), ("b", "4", "o"), ("a", "1", "n"), ("b", "3", "n"))
    )).map(x => testSchema(x)).toDF("myArrays")


val y = x.withColumn("myKeys", lit("b"))

val getMap = udf((mouvements: mutable.WrappedArray[Row]) => {
  val test = mouvements.toArray
    .map(line => (line(0).toString, line(1).toString, line(2).toString))
    .groupBy(_._1)
    .map{case (k,values) => k -> values.map(x => (x._2, x._3))}
  test})


val df_with_map = y.select($"myKeys", getMap($"myArrays") as "myMaps")
df_with_map show false
df_with_map printSchema

现在,我想访问我的数组的第二个元素,第一个元素等于4,地图的键等于b。我应该有这样的结果

+---+
|val|
+---+
|xxx|
|c  |
|o  |
|a  |
+---+

我已经尝试过这个用udf做的:

val getMyValue = udf{(myKey: String, myMaps:  Map[String, WrappedArray[Row]]) =>

  val first_val= "4"
  val myArrays = myMaps.get(myKey)
  val res = myArrays.get.toArray.filter{x => x.getString(0) == first_val}
  res
}

val df_value = df_with_map.select(getMyValue($"myKey",$"myMaps") as "myValue")
df_value show false
df_value printSchema

但它返回错误

java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported

一行:

 val getMyValue = udf{(myKey: String, myMaps:  Map[String, WrappedArray[Row]]) =>

你有什么想法吗?

1 个答案:

答案 0 :(得分:4)

使用:

val first_val = "4"
val df = Seq(
  ("b", Map("b" -> Seq(("1", "o"), ("4", "xxx"))))
).toDF("myKeys", "myMaps")

root
 |-- myKeys: string (nullable = true)
 |-- myMaps: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _1: string (nullable = true)
 |    |    |    |-- _2: string (nullable = true)
df.select($"myMaps".getItem("b"))
  .as[Seq[(String, String)]]
  .flatMap(xs => xs.filter(_._1 == first_val).map(_._2))

修改

df.as[(String, Map[String,Seq[(String, String)]])].flatMap {
  case (key, map) => 
    map.getOrElse(key, Seq[(String, String)]()).filter(_._1 == first_val).map(_._2)
}
相关问题