大熊猫将dict嵌套的不平衡dict转换为数据框

时间:2018-06-27 16:03:12

标签: python pandas dictionary

我将XML数据解析为dict。该字典具有以下形式:

{'id': 'Q1',
'subject': 'Massage oil',
'question': 'Where I can buy good oil for massage?',
'comments': {},
'related': {'Q1_R1': {'rid': 'Q1_R1',
'rel_subject': 'massage oil',
'rel_question': 'is there any place i can find scented massage oils in qatar?',
'rel_givenRelevance': 'PerfectMatch',
'rel_givenRank': '1',
'rel_comments': {'Q1_R1_C1': {'cid': 'Q1_R1_C1',
 'com_date': '2010-08-27 01:40:05',
 'com_username': 'anonymous',
 'comment': 'Yes. It is right behind Kahrama in the National area.',
 'com_isTraining': True},
'Q1_R1_C2': {'cid': 'Q1_R1_C2',
 'com_date': '2010-08-27 01:42:59',
 'com_username': 'sognabodl',
 'comment': 'whats the name of the shop?',
 'com_isTraining': True},
'Q1_R1_C3': {'cid': 'Q1_R1_C3',
 'com_date': '2010-08-27 01:44:09',
 'com_username': 'anonymous',
 'comment': "It's called Naseem Al-Nadir. Right next to the Smartlink shop. You'll find the chinese salesgirls at affordable prices there.",
 'com_isTraining': True},
'Q1_R1_C4': {'cid': 'Q1_R1_C4',
 'com_date': '2010-08-27 01:58:39',
 'com_username': 'sognabodl',
 'comment': 'dont want girls;want oil',
 'com_isTraining': True},
'Q1_R1_C5': {'cid': 'Q1_R1_C5',
 'com_date': '2010-08-27 01:59:55',
 'com_username': 'anonymous',
 'comment': "Try Both ;) I'am just trying to be helpful. On a serious note - Please go there. you'll find what you are looking for.",
 'com_isTraining': True},
'Q1_R1_C6': {'cid': 'Q1_R1_C6',
 'com_date': '2010-08-27 02:02:53',
 'com_username': 'lawa',
 'comment': 'you mean oil and filter both',
 'com_isTraining': True},
'Q1_R1_C7': {'cid': 'Q1_R1_C7',
 'com_date': '2010-08-27 02:04:29',
 'com_username': 'anonymous',
 'comment': "Yes Lawa...you couldn't be more right LOL",
 'com_isTraining': True}},
'rel_featureVector': [],
'rel_isTraining': True}},
'featureVector': [],
'isTraining': True}

一般如:

 {ID     : Q1,
  ...
  related:{
          Q1_R1 :{
                rid:Q1_R1,
                ....
                rel_comments:{
                        Q1_R1_C1: {
                                cid: Q1_R1_C1,
                                ....
                                  }
                        ....
                        Q1_R1_C10
                              }
         ...
        Q1_R10
         }
 ...
 ID : 100   
 }

我想把它变成:

  ID  ...  question rid    ...  rel_question   cid        .... comment
  Q1  ...  1234     Q1_R1  ...  5678         Q1_R1_c1     .... 90
  Q1  ...  1234     Q1_R1  ...  5678         Q1_R1_c2     .... 92
  Q1  ...  1234     Q1_R1  ...  5678         Q1_R1_c3     .... 93
      ..........................................
  Q100 ... 1234   Q100_R10  ... 5678         Q100_R10_c13  ....465

我试图弄平这个字典,但是我得到rid(Q1_R1 ...Q100_R10 )cid( Q1_R1_c1 ... Q100_R10_c13 )作为列,有什么办法吗?

此semeval 2016子任务1'数据,我认为使用dataframe函数,例如apply .. 可以提高性能。例如,要计算Q1问题和Q1_R1_C1评论有多相似?...

1 个答案:

答案 0 :(得分:0)

您必须遍历字典的结构并生成另一个具有正确结构的字典,以便熊猫可以从中制作出所需的DataFrame。这里仅适用于某些列,但您应该明白这一点:

df_dict = {
    'id': [],
    'subject': [],
    'question': [],
    'rid': [],
    'rel_question': [],
    'cid': [],
    'comment': []
 }
for rid in mydict['related']:
    for cid in mydict['related'][rid]['rel_comments']:
        df_dict['id'].append(mydict['id'])
        df_dict['subject'].append(mydict['subject'])
        df_dict['question'].append(mydict['question'])
        df_dict['rid'].append(rid)
        df_dict['rel_question'].append(mydict['related'][rid]['rel_question'])
        df_dict['cid'].append(cid)
        df_dict['comment'].append(mydict['related'][rid]['rel_comments'][cid]['comment'])

df = pd.DataFrame(df_dict)