根据pandas中另一个字段的优先级列表删除1个字段上的重复项

时间:2016-11-29 14:19:07

标签: pandas

我有大量数据,我试图根据2个字段删除重复项。样本集:

WOE_ID  ISO Locationname    Language    Placetype   Parent_ID   ID  Username
2347578 US  Maine           ENG         State       23424977    1   sampleuser
2444322 US  Maine           ENG         Town        12588275    1   sampleuser
2444324 US  Maine           ENG         Town        12588852    1   sampleuser
2444326 US  Maine           ENG         POI         12589403    1   sampleuser
2444327 US  Maine           ENG         Town        12587582    1   sampleuser
2444325 US  Maine           ENG         Country     12589315    1   sampleuser
28744443US  Maine           ENG         Town        12590578    1   sampleuser
2444323 US  Maine           ENG         Town        2374968     1   sampleuser

由于这些都是ID(1)的重复值,我想只保留条目最大的Placetype(这里是国家,国家>州>镇> POI)。有没有一种简单的方法来做到这一点我忽略或者我是否必须编写一个比较所有条目的循环?我宁愿不这样做,因为总数据库中有300多万个条目,我可能要多次运行它。

提前致谢!

1 个答案:

答案 0 :(得分:4)

我认为您可以使用有序Categorical,然后按sort_valuesDataFrame列排序Placetype,然后使用汇总groupby first排序print (df) WOE_ID ISO Locationname Language Placetype Parent_ID ID Username 0 2347578 US Maine ENG State 23424977 1 sampleuser 1 2444322 US Maine ENG Town 12588275 1 sampleuser 2 2444324 US Maine ENG Town 12588852 1 sampleuser 3 2444326 US Maine ENG POI 12589403 2 sampleuser 4 2444327 US Maine ENG Town 12587582 2 sampleuser 5 2444325 US Maine ENG Country 12589315 3 sampleuser 6 28744443 US Maine ENG Town 12590578 3 sampleuser 7 2444323 US Maine ENG Town 2374968 3 sampleuser df.Placetype = df.Placetype.astype('category', categories=['Country','State','Town','POI'], ordered=True) df = df.sort_values('Placetype').groupby('ID', as_index=False).first() print (df) ID WOE_ID ISO Locationname Language Placetype Parent_ID Username 0 1 2347578 US Maine ENG State 23424977 sampleuser 1 2 2444327 US Maine ENG Town 12587582 sampleuser 2 3 2444325 US Maine ENG Country 12589315 sampleuser

{{1}}