Question

我创建了一个PySpark应用程序，它通过定义的Schema读取数据帧中的JSON文件。下面的代码示例

schema = StructType([
    StructField("domain", StringType(), True),
     StructField("timestamp", LongType(), True),                            
])
df= sqlContext.read.json(file, schema)

我需要一种方法来查找如何在一种配置或ini文件等中定义此模式。并在PySpark应用程序的主要内容中阅读。

如果将来不需要更改主PySpark代码，这将帮助我修改更改JSON的模式。

Answer 1

StructType提供了json和jsonValue方法，分别用于获取json和dict表示，以及fromJson可以使用的方法将Python字典转换为StructType。

schema = StructType([
    StructField("domain", StringType(), True),
    StructField("timestamp", LongType(), True),                            
])

StructType.fromJson(schema.jsonValue())

除此之外，您唯一需要的是内置json模块，用于解析dict可以使用的StructType输入。

对于Scala版本，请参阅How to create a schema from CSV file and persist/save that schema to a file?

Answer 2

您可以使用以下格式创建一个名为schema.json的JSON文件

{
  "fields": [
    {
      "metadata": {},
      "name": "first_fields",
      "nullable": true,
      "type": "string"
    },
    {
      "metadata": {},
      "name": "double_field",
      "nullable": true,
      "type": "double"
    }
  ],
  "type": "struct"
}

通过读取此文件创建结构模式

rdd = spark.sparkContext.wholeTextFiles("s3://<bucket>/schema.json")
text = rdd.collect()[0][1]
dict = json.loads(str(text))
custom_schema = StructType.fromJson(dict)

之后，您可以使用struct作为架构来读取JSON文件

val df=spark.read.json("path", custom_schema)

配置文件以在PySpark中定义JSON模式结构

2 个答案: