Question

我需要帮助，使用Apache Spark / Scala将平面数据集转换为嵌套格式。

是否可以自动创建从输入列名称空间派生的嵌套结构

[级别1] 。 [级别2] ？在我的示例中，嵌套级别由列标题中的句点符号'。'确定。

我认为使用map函数可以实现这一点。我愿意接受其他解决方案，特别是如果有更优雅的方式来实现相同结果的话。

package org.acme.au

import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import scala.collection.Seq

object testNestedObject extends App {

  // Configure spark
  val spark = SparkSession.builder()
    .appName("Spark batch demo")
    .master("local[*]")
    .config("spark.driver.host", "localhost")
    .getOrCreate()

  // Start spark
  val sc = spark.sparkContext
  sc.setLogLevel("ERROR")
  val sqlContext = new SQLContext(sc)

  // Define schema for input data
  val flatSchema = new StructType()
    .add(StructField("id", StringType, false))
    .add(StructField("name", StringType, false))
    .add(StructField("custom_fields.fav_colour", StringType, true))
    .add(StructField("custom_fields.star_sign", StringType, true))

  // Create a row with dummy data
  val row1 = Row("123456", "John Citizen", "Blue", "Scorpio")
  val row2 = Row("990087", "Jane Simth", "Green", "Taurus")
  val flatData = Seq(row1, row2)

  // Convert into dataframe
  val dfIn = spark.createDataFrame(spark.sparkContext.parallelize(flatData), flatSchema)

  // Print to console
  dfIn.printSchema()
  dfIn.show()

  // Convert flat data into nested structure as either Parquet or JSON format
  val dfOut = dfIn.rdd
    .map(
      row => ( /* TODO: Need help with mapping flat data to nested structure derived from input column namespaces
           * 
           * For example:
           * 
           * <id>12345<id>
           * <name>John Citizen</name>
           * <custom_fields>
           *   <fav_colour>Blue</fav_colour>
           *   <star_sign>Scorpio</star_sign>
           * </custom_fields>
           * 
           */ ))

  // Stop spark
  sc.stop()

}

Answer 1

这可以通过专用的case class和UDF来解决，该case class NestedFields(fav_colour: String, star_sign: String)将输入数据转换为案例类实例。例如：

定义案例类

NestedFields

定义将原始列值作为输入并返回private val asNestedFields = udf((fc: String, ss: String) => NestedFields(fc, ss))实例的UDF：

val res = dfIn.withColumn("custom_fields", asNestedFields($"`custom_fields.fav_colour`", $"`custom_fields.star_sign`"))
              .drop($"`custom_fields.fav_colour`")
              .drop($"`custom_fields.star_sign`")

转换原始DataFrame并放下扁平列：

root
|-- id: string (nullable = false)
|-- name: string (nullable = false)
|-- custom_fields: struct (nullable = true)
|    |-- fav_colour: string (nullable = true)
|    |-- star_sign: string (nullable = true)

+------+------------+---------------+
|    id|        name|  custom_fields|
+------+------------+---------------+
|123456|John Citizen|[Blue, Scorpio]|
|990087|  Jane Simth|[Green, Taurus]|
+------+------------+---------------+

它产生

{{1}}

Answer 2

这是一个通用的解决方案，它首先组装包含.的列名的Map，遍历Map以将转换后的struct列添加到DataFrame，最后使用{{ 1}}。稍微更通用的.用作样本数据。

dfIn

请注意，此解决方案最多只能处理一个嵌套级别。

要将每一行转换为JSON格式，请考虑使用import org.apache.spark.sql.functions._ val dfIn = Seq( (123456, "John Citizen", "Blue", "Scorpio", "a", 1), (990087, "Jane Simth", "Green", "Taurus", "b", 2) ). toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign", "s.c1", "s.c2") val structCols = dfIn.columns.filter(_.contains(".")) // structCols: Array[String] = // Array(custom_fields.fav_colour, custom_fields.star_sign, s.c1, s.c2) val structColsMap = structCols.map(_.split("\\.")). groupBy(_(0)).mapValues(_.map(_(1))) // structColsMap: scala.collection.immutable.Map[String,Array[String]] = // Map(s -> Array(c1, c2), custom_fields -> Array(fav_colour, star_sign)) val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) => val cols = kv._2.map(v => col("`" + kv._1 + "." + v + "`").as(v)) accDF.withColumn(kv._1, struct(cols: _*)) } val dfResult = structCols.foldLeft(dfExpanded)(_ drop _) dfResult.show // +------+------------+-----+--------------+ // |id |name |s |custom_fields | // +------+------------+-----+--------------+ // |123456|John Citizen|[a,1]|[Blue,Scorpio]| // |990087|Jane Simth |[b,2]|[Green,Taurus]| // +------+------------+-----+--------------+ dfResult.printSchema // root // |-- id: integer (nullable = false) // |-- name: string (nullable = true) // |-- s: struct (nullable = false) // | |-- c1: string (nullable = true) // | |-- c2: integer (nullable = false) // |-- custom_fields: struct (nullable = false) // | |-- fav_colour: string (nullable = true) // | |-- star_sign: string (nullable = true)，如下所示：

toJSON

Answer 3

此解决方案适用于修订后的要求，即JSON输出将由array of {K:valueK, V:valueV}而不是{valueK1: valueV1, valueK2: valueV2, ...}组成。例如：

// FROM:
"custom_fields":{"fav_colour":"Blue", "star_sign":"Scorpio"}

// TO:
"custom_fields":[{"key":"fav_colour", "value":"Blue"}, {"key":"star_sign", "value":"Scorpio"}]

下面的示例代码：

import org.apache.spark.sql.functions._

val dfIn = Seq(
  (123456, "John Citizen", "Blue", "Scorpio"),
  (990087, "Jane Simth", "Green", "Taurus")
).toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign")

val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
//   Array(custom_fields.fav_colour, custom_fields.star_sign)

val structColsMap = structCols.map(_.split("\\.")).
  groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
//   Map(custom_fields -> Array(fav_colour, star_sign))

val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
  val cols = kv._2.map( v =>
    struct(lit(v).as("key"), col("`" + kv._1 + "." + v + "`").as("value"))
  )
  accDF.withColumn(kv._1, array(cols: _*))
}

val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)

dfResult.show(false)
// +------+------------+----------------------------------------+
// |id    |name        |custom_fields                           |
// +------+------------+----------------------------------------+
// |123456|John Citizen|[[fav_colour,Blue], [star_sign,Scorpio]]|
// |990087|Jane Simth  |[[fav_colour,Green], [star_sign,Taurus]]|
// +------+------------+----------------------------------------+

dfResult.printSchema
// root
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)
//  |-- custom_fields: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- key: string (nullable = false)
//  |    |    |-- value: string (nullable = true)

dfResult.toJSON.show(false)
// +-------------------------------------------------------------------------------------------------------------------------------+
// |value                                                                                                                          |
// +-------------------------------------------------------------------------------------------------------------------------------+
// |{"id":123456,"name":"John Citizen","custom_fields":[{"key":"fav_colour","value":"Blue"},{"key":"star_sign","value":"Scorpio"}]}|
// |{"id":990087,"name":"Jane Simth","custom_fields":[{"key":"fav_colour","value":"Green"},{"key":"star_sign","value":"Taurus"}]}  |
// +-------------------------------------------------------------------------------------------------------------------------------+

请注意，由于Spark DataFrame API不支持类型value，因此我们无法使Any类型为Any来容纳不同类型的混合。结果，数组中的value必须是给定的类型（例如String）。像以前的解决方案一样，它最多也只能处理一个嵌套级别。

使用Spark Scala将平面数据转换为嵌套对象

3 个答案: