将数据框转换为列名称和值的结构数组

时间:2019-05-10 13:34:53

标签: scala apache-spark apache-spark-sql

假设我有一个像这样的数据框

val customer = Seq(
    ("C1", "Jackie Chan", 50, "Dayton", "M"),
    ("C2", "Harry Smith", 30, "Beavercreek", "M"),
    ("C3", "Ellen Smith", 28, "Beavercreek", "F"),
    ("C4", "John Chan", 26, "Dayton","M")
  ).toDF("cid","name","age","city","sex")

如何在一栏中获取cid值,并在spark array < struct < column_name, column_value > >中获取其余值

3 个答案:

答案 0 :(得分:4)

唯一的困难是数组必须包含相同类型的元素。因此,您需要先将所有列都转换为字符串,然后再将它们放入数组中(age是一个int值)。这是怎么回事:

val cols = customer.columns.tail
val result = customer.select('cid,
    array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")

result.show(false)

+---+-----------------------------------------------------------+
|cid|array                                                      |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]]     |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]]       |
+---+-----------------------------------------------------------+

result.printSchema()

root
 |-- cid: string (nullable = true)
 |-- array: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- name: string (nullable = false)
 |    |    |-- value: string (nullable = true)

答案 1 :(得分:2)

您可以使用数组和结构函数:

customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))

将使:

 |-- cid: string (nullable = true)
 |-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- column_name: string (nullable = false)
 |    |    |-- column_value: string (nullable = true)

答案 2 :(得分:1)

映射列可能是解决整体问题的更好方法。您可以在同一映射中保留不同的值类型,而不必将其强制转换为字符串。

df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )

或根据需要将地图列包装到数组中

这样,您仍然可以对相关的键或值进行数字或字符串转换。例如:

df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )
df.select('*', 
      map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)

希望如此,请在此处键入此笔直,以便宽恕如果某处缺少括号的情况