我有以下格式的数据:
customer_id | 型号 |
---|---|
1 | [{color: 'red', group: 'A'},{color: 'green', group: 'B'}] |
2 | [{color: 'red', group: 'A'}] |
我需要处理它,以便创建一个具有以下输出的新数据框:
customer_id | 颜色 | 组 |
---|---|---|
1 | 红色 | A |
1 | 绿色 | B |
2 | 红色 | A |
现在我可以用 python 轻松做到这一点:
import pandas as pd
import json
newdf = pd.DataFrame([])
for index, row in df.iterrows():
s = row['model']
x = json.loads(s)
colors_list = []
users_list = []
groups_list = []
for i in range(len(x)):
colors_list.append(x[i]['color'])
users_list.append(row['user_id'])
groups_list.append(x[i]['group'])
newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))
如何使用 pyspark 实现相同的结果?
我正在显示原始数据帧的第一行和架构:
+-----------+--------------------+
|customer_id| model |
+-----------+--------------------+
| 3541|[{"score":0.04767...|
| 171811|[{"score":0.04473...|
| 12008|[{"score":0.08043...|
| 78964|[{"score":0.06669...|
| 119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows
root
|-- user_id: integer (nullable = true)
|-- groups: string (nullable = true)
答案 0 :(得分:3)
from_json 可以解析包含 Json 数据的字符串列:
from pyspark.sql import functions as F
from pyspark.sql import types as T
data = [[1, "[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]"],
[2, "[{color: 'red', group: 'A'}]"]]
df = spark.createDataFrame(data, schema = ["customer_id", "model"]) \
.withColumn("model", F.from_json("model", T.ArrayType(T.MapType(T.StringType(), T.StringType())), {"allowUnquotedFieldNames": True})) \
.withColumn("model", F.explode("model")) \
.withColumn("color", F.col("model")["color"]) \
.withColumn("group", F.col("model")["group"]) \
.drop("model")
结果:
+-----------+-----+-----+
|customer_id|color|group|
+-----------+-----+-----+
| 1| red| A|
| 1|green| B|
| 2| red| A|
+-----------+-----+-----+