Question

我有一堆文件，每行类似：

some random non json stuff here {"timestmap":21212121, "name":"John"}

由于存在Json数据之前的随机内容，因此我无法将这些文件读取为json。

为了能够将Json数据加载到具有适当列的DF中，清理随机数据的最佳方法是什么？

最终目标是拥有仅包含时间戳记在特定日期之间的数据的最终DF。

Answer 1

此解决方案使用

instr查找JSON大括号{和}的出现
substr获取花括号（JSON文本）之间的子字符串

然后，它将from_json与定义期望的JSON结构的模式结合使用。

from pyspark.sql.functions import from_json, instr
from pyspark.sql.types import *

# Expected JSON schema 
schema = StructType([StructField("timestmap", TimestampType()),
                     StructField("name", StringType())])
# Filtering and parsing
parsed = df.select(from_json(
                df.value.substr(instr(df.value, '{'), instr(df.value, '}')), 
                schema).alias("json"))

# Don't know if it's possible to do it in one step ...
parsed = parsed.select(F.col("json.timestmap").alias("timestmap"),
                       F.col("json.name").alias("name"))

parsed.printSchema()
parsed.show()

结果是

root
 |-- timestmap: timestamp (nullable = true)
 |-- name: string (nullable = true)

+-------------------+----+
|          timestmap|name|
+-------------------+----+
|1970-09-03 12:15:21|John|
|1970-09-03 12:15:22| Doe|
+-------------------+----+

示例文本文件random.txt是

some random non json stuff here {"timestmap":21212121, "name":"John"}
some other random non json stuff here {"timestmap":21212122, "name":"Doe"}

Spark如何才能从一行中仅提取Json数据

1 个答案: