我有大量数据,我试图根据2个字段删除重复项。样本集:
WOE_ID ISO Locationname Language Placetype Parent_ID ID Username
2347578 US Maine ENG State 23424977 1 sampleuser
2444322 US Maine ENG Town 12588275 1 sampleuser
2444324 US Maine ENG Town 12588852 1 sampleuser
2444326 US Maine ENG POI 12589403 1 sampleuser
2444327 US Maine ENG Town 12587582 1 sampleuser
2444325 US Maine ENG Country 12589315 1 sampleuser
28744443US Maine ENG Town 12590578 1 sampleuser
2444323 US Maine ENG Town 2374968 1 sampleuser
由于这些都是ID(1)的重复值,我想只保留条目最大的Placetype(这里是国家,国家>州>镇> POI)。有没有一种简单的方法来做到这一点我忽略或者我是否必须编写一个比较所有条目的循环?我宁愿不这样做,因为总数据库中有300多万个条目,我可能要多次运行它。
提前致谢!
答案 0 :(得分:4)
我认为您可以使用有序Categorical
,然后按sort_values
按DataFrame
列排序Placetype
,然后使用汇总groupby
first
排序print (df)
WOE_ID ISO Locationname Language Placetype Parent_ID ID Username
0 2347578 US Maine ENG State 23424977 1 sampleuser
1 2444322 US Maine ENG Town 12588275 1 sampleuser
2 2444324 US Maine ENG Town 12588852 1 sampleuser
3 2444326 US Maine ENG POI 12589403 2 sampleuser
4 2444327 US Maine ENG Town 12587582 2 sampleuser
5 2444325 US Maine ENG Country 12589315 3 sampleuser
6 28744443 US Maine ENG Town 12590578 3 sampleuser
7 2444323 US Maine ENG Town 2374968 3 sampleuser
df.Placetype = df.Placetype.astype('category',
categories=['Country','State','Town','POI'],
ordered=True)
df = df.sort_values('Placetype').groupby('ID', as_index=False).first()
print (df)
ID WOE_ID ISO Locationname Language Placetype Parent_ID Username
0 1 2347578 US Maine ENG State 23424977 sampleuser
1 2 2444327 US Maine ENG Town 12587582 sampleuser
2 3 2444325 US Maine ENG Country 12589315 sampleuser
:
{{1}}