带有 json 的 Pyspark 数据框,迭代以创建新的数据框

时间:2021-04-28 19:13:26

标签: json pyspark

我有以下格式的数据:

<头>
customer_id 型号
1 [{color: 'red', group: 'A'},{color: 'green', group: 'B'}]
2 [{color: 'red', group: 'A'}]

我需要处理它,以便创建一个具有以下输出的新数据框:

<头>
customer_id 颜色
1 红色 A
1 绿色 B
2 红色 A

现在我可以用 python 轻松做到这一点:

import pandas as pd
import json

newdf = pd.DataFrame([])

for index, row in df.iterrows():
    s = row['model']
    x = json.loads(s)
    
    colors_list = []
    users_list = []
    groups_list = []
    
    for i in range(len(x)):
        colors_list.append(x[i]['color'])
        users_list.append(row['user_id'])
        groups_list.append(x[i]['group'])
        
    newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))

如何使用 pyspark 实现相同的结果?

我正在显示原始数据帧的第一行和架构:

+-----------+--------------------+
|customer_id|              model |
+-----------+--------------------+
|       3541|[{"score":0.04767...|
|     171811|[{"score":0.04473...|
|      12008|[{"score":0.08043...|
|      78964|[{"score":0.06669...|
|     119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows

root
 |-- user_id: integer (nullable = true)
 |-- groups: string (nullable = true)

1 个答案:

答案 0 :(得分:3)

from_json 可以解析包含 Json 数据的字符串列:

from pyspark.sql import functions as F
from pyspark.sql import types as T

data = [[1, "[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]"],
        [2, "[{color: 'red', group: 'A'}]"]]

df = spark.createDataFrame(data, schema = ["customer_id", "model"]) \
    .withColumn("model", F.from_json("model", T.ArrayType(T.MapType(T.StringType(), T.StringType())), {"allowUnquotedFieldNames": True})) \
    .withColumn("model", F.explode("model")) \
    .withColumn("color", F.col("model")["color"]) \
    .withColumn("group", F.col("model")["group"]) \
    .drop("model")

结果:

+-----------+-----+-----+
|customer_id|color|group|
+-----------+-----+-----+
|          1|  red|    A|
|          1|green|    B|
|          2|  red|    A|
+-----------+-----+-----+
相关问题