从json-column创建新列

时间:2019-03-06 13:06:02

标签: python pandas

我有一个数据框,其列为:event_name,带有json-objects(不同类型的对象)。我想将此列拆分为新列(如json对象)。

创建df:

d = [{'event_datetime': '2019-01-08 00:09:30',
  'event_json': '{"lvl":"450","tok":"1212","snum":"257","udid":"122112"}',
  'event_name': 'AdsClick'},
 {'event_datetime': '2019-01-08 00:43:21',
  'event_json': '{"lvl":"902","udid":"3123","tok":"4214","snum":"1387"}',
  'event_name': 'AdsClick'},
 {'event_datetime': '2019-02-08 00:05:01',
  'event_json': '{"lvl":"1415","udid":"214124","tok":"213123","snum":"2416","col12":"2416","col13":"2416"}'}]

df12 = json_normalize(d)

示例:

event_datetime  event_json  event_name
0   2019-02-08 00:09:30 {"lvl":"450","tok":"1212","snum":"257","udid":...   AdsClick
1   2019-02-08 00:43:21 {"lvl":"902","udid":"3123","tok":"4214","snum"...   AdsClick
2   2019-02-08 00:05:01 {"lvl":"1415","udid":"214124","tok":"213123","...   NaN

现在我使用此代码:

df12 = df12.merge(df12['event_json'].apply(lambda x: pd.Series(json.loads(x))), left_index=True, right_index=True)

结果:

event_datetime  event_json  event_name  lvl snum    tok udid    col12   col13
0   2019-02-08 00:09:30 {"lvl":"450","tok":"1212","snum":"257","udid":...   AdsClick    450 257 1212    122112  NaN NaN
1   2019-02-08 00:43:21 {"lvl":"902","udid":"3123","tok":"4214","snum"...   AdsClick    902 1387    4214    3123    NaN NaN
2   2019-02-08 00:05:01 {"lvl":"1415","udid":"214124","tok":"213123","...   NaN 1415    2416    213123  214124  2416    2416

但是它非常慢。您对更快的代码有任何想法吗?

1 个答案:

答案 0 :(得分:1)

将列表理解与DataFrame构造函数一起使用,并由DataFrame.join添加到原始列表:

df = df12.join(pd.DataFrame([json.loads(x) for x in df12['event_json']]))
print (df)
        event_datetime                                         event_json  \
0  2019-01-08 00:09:30  {"lvl":"450","tok":"1212","snum":"257","udid":...   
1  2019-01-08 00:43:21  {"lvl":"902","udid":"3123","tok":"4214","snum"...   
2  2019-02-08 00:05:01  {"lvl":"1415","udid":"214124","tok":"213123","...   

  event_name col12 col13   lvl  snum     tok    udid  
0   AdsClick   NaN   NaN   450   257    1212  122112  
1   AdsClick   NaN   NaN   902  1387    4214    3123  
2        NaN  2416  2416  1415  2416  213123  214124  

如果还需要删除源列,请使用DataFrame.pop

df = df12.join(pd.DataFrame([json.loads(x) for x in df12.pop('event_json')]))
print (df)
        event_datetime event_name col12 col13   lvl  snum     tok    udid
0  2019-01-08 00:09:30   AdsClick   NaN   NaN   450   257    1212  122112
1  2019-01-08 00:43:21   AdsClick   NaN   NaN   902  1387    4214    3123
2  2019-02-08 00:05:01        NaN  2416  2416  1415  2416  213123  214124
相关问题