匹配和合并两个不一致的数据框

时间:2018-11-10 00:51:58

标签: python pandas

我要从两个具有共同特征但并不总是相同特征的独立数据库中查询两个数据帧,我需要找到一种将两者可靠地结合在一起的方法。

例如:

import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)

   Age   Location Mothers Name       Name Occupation
0   12  Frankfurt         Rosy       Jose    Student
1   23       Maui          Amy  Katherine     Lawyer
2   22     Dallas       Monica      Larry      Nurse



inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)

  Favorite Hobby Mothers Name       Name Occupation
0     Basketball       Monica                 Nurse
1         Sewing         Rosy       Jose           
2        Reading               Katherine     Lawyer

我需要找出一种方法来可靠地连接这两个数据帧,而数据却始终不保持一致。为了使问题更加复杂,两个数据库的长度并不总是相同。有任何想法吗?

1 个答案:

答案 0 :(得分:0)

您可以在可能的列组合上执行合并,并合并这些df,然后在第一个(完整)df上合并新的df:

# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']), 
                    df.merge(df2, on=['Name', 'Occupation']),
                    df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)

# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()

    Age Location    Mothers Name    Name      Occupation    Favorite Hobby
0   12  Frankfurt      Rosy         Jose       Student        Sewing
2   23  Maui           Amy          Katherine  Lawyer         Reading
4   22  Dallas         Monica       Larry      Nurse          Basketball

这假设每一行的年龄和位置都是唯一的