我有一个如下所示的数据框,我想根据某些标准删除重复项。 1)如果开始日期大于Month,它将被删除。 2)如果开始日期少于Month,请保留最新记录。
> COMP Month Startdate bundle result
> 0 TD3M 2018-03-01 2015-08-28 01_Essential keep
> 1 TD3M 2018-03-01 2018-07-17 04_Complete remove
> 2 TD3M 2018-04-01 2015-08-28 01_Essential keep
> 3 TD3M 2018-04-01 2018-07-17 04_Complete remove
> 4 TD3M 2018-05-01 2015-08-28 01_Essential keep
> 5 TD3M 2018-05-01 2018-07-17 04_Complete remove
> 6 TD3M 2018-06-01 2015-08-28 01_Essential keep
> 7 TD3M 2018-06-01 2018-07-17 04_Complete remove
> 8 TD3M 2018-08-01 2015-08-28 01_Essential remove
> 9 TD3M 2018-08-01 2018-07-17 04_Complete keep
> 10 TD3M 2018-09-01 2015-08-28 01_Essential remove
> 11 TD3M 2018-09-01 2018-07-17 04_Complete keep
预期输出为:
> COMP Month Startdate bundle
> 0 TD3M 2018-03-01 2015-08-28 01_Essential
> 2 TD3M 2018-04-01 2015-08-28 01_Essential
> 4 TD3M 2018-05-01 2015-08-28 01_Essential
> 6 TD3M 2018-06-01 2015-08-28 01_Essential
> 9 TD3M 2018-08-01 2018-07-17 04_Complete
> 11 TD3M 2018-09-01 2018-07-17 04_Complete
答案 0 :(得分:1)
首先,我将您的列“结果”删除:
df = df.drop(columns='result')
首先检查您的“月”和“开始日期”字段是否为日期时间格式:
df.Month = pd.to_datetime(df.Month)
df.Startdate = pd.to_datetime(df.Startdate)
然后过滤器和分组依据(最大合计):
df = df[df.Startdate <= df.Month]
df.groupby(['COMP', 'Month'], as_index=False).max()
答案 1 :(得分:0)
这是使用sort_values
drop_duplicates
df.query('Startdate<=Month').sort_values('Startdate').drop_duplicates('Month',keep='last')
Out[892]:
COMP Month Startdate bundle result
0 TD3M 2018-03-01 2015-08-28 01_Essential keep
2 TD3M 2018-04-01 2015-08-28 01_Essential keep
4 TD3M 2018-05-01 2015-08-28 01_Essential keep
6 TD3M 2018-06-01 2015-08-28 01_Essential keep
9 TD3M 2018-08-01 2018-07-17 04_Complete keep
11 TD3M 2018-09-01 2018-07-17 04_Complete keep