Question

我将下面的JSON字符串加载到dataframe列中。

{
    "title": {
        "titleid": "222",
        "titlename": "ABCD"
    },
    "customer": {
        "customerDetail": {
            "customerid": 878378743,
            "customerstatus": "ACTIVE",
            "customersystems": {
                "customersystem1": "SYS01",
                "customersystem2": null
            },
            "sysid": null
        },
        "persons": [{
            "personid": "123",
            "personname": "IIISKDJKJSD"
        },
        {
            "personid": "456",
            "personname": "IUDFIDIKJK"
        }]
    }
}

val js = spark.read.json("./src/main/resources/json/customer.txt")
println(js.schema)
val newDF = df.select(from_json($"value", js.schema).as("parsed_value"))
newDF.selectExpr("parsed_value.customer.*").show(false)

//架构：

StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true))

//输出：

+------------------------------+---------------------------------------+
|customerDetail                |persons                                |
+------------------------------+---------------------------------------+
|[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]|
+------------------------------+---------------------------------------+

我的问题：有没有办法将键值拆分为separate dataframe columns，如下所示保持Array columns不变，因为我只需要one record per json string：

customer column的示例：

customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons
878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] }

Answer 1

编辑后的帖子：

val df = spark.read.json("your/path/data.json")
import org.apache.spark.sql.functions.col
def collectFields(field: String, sc: DataType): Seq[String] = {
  sc match {
    case sf: StructType => sf.fields.flatMap(f => collectFields(field+"."+f.name, f.dataType))
    case _ => Seq(field)
  }
}

val fields = collectFields("",df.schema).map(_.tail)

df.select(fields.map(col):_*).show(false)

输出：

+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|customerid|customerstatus|customersystem1|customersystem2|sysid|persons                              |titleid|titlename|
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|878378743 |ACTIVE        |SYS01          |null           |null |[[123,IIISKDJKJSD], [456,IUDFIDIKJK]]|222    |ABCD     |
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+

Answer 2

您可以在RDD的帮助下尝试，方法是在空的RDD中定义列名，然后读取json，使用.toDF（）将其转换为DataFrame，然后将其迭代为空的RDD。

Spark：将JSON字符串拆分为单独的数据框列

2 个答案: