Question

我有像这样的JSON输入


    {
      "1": {
        "id": 1,
        "value": 5
      },
      "2": {
        "id": 2,
        "list": {
          "10": {
            "id": 10
          },
          "11": {
            "id": 11
          },
          "20": {
            "id": 20
          }
        }
      },
      "3": {
        "id": 3,
        "key": "a"
      }
    }

我需要合并3列并为每列提取所需的值，这就是我需要的输出：


    {
      "out": {
        "1": 5,
        "2": [10, 11, 20],
        "3": "a"
      }
    }

我尝试创建一个UDF来将这3列转换为1，但我无法弄清楚如何使用混合值类型定义MapType（） - IntegerType()，ArrayType(IntegerType())和StringType()分别

提前致谢！

Answer 1

您需要使用StructType定义UDF的结果类型，而不是MapType，如下所示：

from pyspark.sql.types import *

udf_result = StructType([
    StructField('1', IntegerType()),
    StructField('2', ArrayType(StringType())),
    StructField('3', StringType())
])

Answer 2

MapType()用于（键，值）对定义，不用于嵌套数据帧。您正在寻找的是StructType()

您可以使用createDataFrame直接加载它，但是您必须传递架构，因此这种方式更容易：

import json

data_json = {
      "1": {
        "id": 1,
        "value": 5
      },
      "2": {
        "id": 2,
        "list": {
          "10": {
            "id": 10
          },
          "11": {
            "id": 11
          },
          "20": {
            "id": 20
          }
        }
      },
      "3": {
        "id": 3,
        "key": "a"
      }
    }
a=[json.dumps(data_json)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
df.printSchema()

    root
     |-- 1: struct (nullable = true)
     |    |-- id: long (nullable = true)
     |    |-- value: long (nullable = true)
     |-- 2: struct (nullable = true)
     |    |-- id: long (nullable = true)
     |    |-- list: struct (nullable = true)
     |    |    |-- 10: struct (nullable = true)
     |    |    |    |-- id: long (nullable = true)
     |    |    |-- 11: struct (nullable = true)
     |    |    |    |-- id: long (nullable = true)
     |    |    |-- 20: struct (nullable = true)
     |    |    |    |-- id: long (nullable = true)
     |-- 3: struct (nullable = true)
     |    |-- id: long (nullable = true)
     |    |-- key: string (nullable = true)

现在访问嵌套的数据帧。请注意列＆＃34; 2＆＃34;比其他嵌套更嵌套：

nested_cols = ["2"]
cols = ["1", "3"]
import pyspark.sql.functions as psf
df = df.select(
    cols + [psf.array(psf.col(c + ".list.*")).alias(c) for c in nested_cols]
)
df = df.select(
    [df[c].id.alias(c) for c in df.columns]
)

    root
     |-- 1: long (nullable = true)
     |-- 3: long (nullable = true)
     |-- 2: array (nullable = false)
     |    |-- element: long (containsNull = true)

由于您希望将其嵌套在"out"列中，因此它不完全是您的最终输出：

import pyspark.sql.functions as psf
df.select(psf.struct("*").alias("out")).printSchema()

    root
     |-- out: struct (nullable = false)
     |    |-- 1: long (nullable = true)
     |    |-- 3: long (nullable = true)
     |    |-- 2: array (nullable = false)
     |    |    |-- element: long (containsNull = true)

最后回到JSON：

df.toJSON().first()

    '{"1":1,"3":3,"2":[10,11,20]}'

具有混合值类型

2 个答案: