如何将带有dicts列表的pandas列拆分为每个键的单独列

时间:2021-01-07 23:42:20

标签: python pandas json-normalize

我正在分析 来自 Facebook 的政治广告,这是由 ProPublica dataset 发布的here

这就是我的意思。 我有一整列想要分析的目标,但它的格式对于我的技能水平的人来说非常难以访问。

这仅来自 1 个单元格: [{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]

和另一个: [{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]

我需要做的是将每个“目标”项目分开以成为列标签,而其每个对应的“段”则成为该列中的一个可能值。

或者,创建一个函数来调用每行内的每个字典键来计算频率的解决方案是什么?

1 个答案:

答案 0 :(得分:2)

  • 列是 listsdicts
    • 可以使用 dictlist 中的每个 pandas.explode() 移到单独的列中。
    • 通过使用 dicts, pandas.json_normalize().join() 的列转换为一个数据框,其中键是列标题,值是观察值。< /li>
  • 使用 df 删除不需要的列。
  • 如果该列包含字符串形式的字典列表(例如 .drop()),请在 solution 中引用此 Splitting dictionary/list inside a Pandas Column into Separate Columns,并使用:
    • "[{key: value}]",带有 df.col2 = df.col2.apply(literal_eval)
from ast import literal_eval

import pandas as pd # create sample dataframe df = pd.DataFrame({'col1': ['x', 'y'], 'col2': [[{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}], [{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]]}) # display(df) col1 col2 0 x [{'target': 'NAge', 'segment': '21 and older'}, {'target': 'MinAge', 'segment': '21'}, {'target': 'Retargeting', 'segment': 'people who may be similar to their customers'}, {'target': 'Region', 'segment': 'the United States'}] 1 y [{'target': 'NAge', 'segment': '18 and older'}, {'target': 'Location Type', 'segment': 'HOME'}, {'target': 'Interest', 'segment': 'Hispanic culture'}, {'target': 'Interest', 'segment': 'Republican Party (United States)'}, {'target': 'Location Granularity', 'segment': 'country'}, {'target': 'Country', 'segment': 'the United States'}, {'target': 'MinAge', 'segment': 18}] # use explode to give each dict in a list a separate row df = df.explode('col2').reset_index(drop=True) # normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column df = df.join(pd.json_normalize(df.col2)).drop(columns=['col2'])

display(df)

获取 col1 target segment 0 x NAge 21 and older 1 x MinAge 21 2 x Retargeting people who may be similar to their customers 3 x Region the United States 4 y NAge 18 and older 5 y Location Type HOME 6 y Interest Hispanic culture 7 y Interest Republican Party (United States) 8 y Location Granularity country 9 y Country the United States 10 y MinAge 18

  • 如果目标是为每个 count 和关联的 count 获取 'target'
'segment'

更新

  • 此更新是针对完整文件实施的
counts = df.groupby(['target', 'segment']).count()
相关问题