删除少于X个连续日期的观测值

时间:2019-03-07 23:49:16

标签: python pandas

以下数据框,其中包含同一公司在不同日期(列日期)的数据(列ID)。我想删除少于3天的观察结果。

起始数据集是

df = pd.DataFrame({"ID":{"0":1,"1":1,"2":1,"3":1,"4":4,"5":4,"6":4,"7":2,"8":2,"9":3,"10":3},
    "date":{"0":1421020800000,"1":1421193600000,"2":1422489600000,"3":1423353600000,"4":1421020800000,"5":1421107200000,"6":1421193600000,"7":1421020800000,"8":1421107200000,"9":1421452800000,"10":1421539200000},
    "variable":{"0":28,"1":62,"2":60,"3":72,"4":28,"5":61,"6":62,"7":23,"8":70,"9":32,"10":55}})
df.date = pd.to_datetime(df.date, unit='ms')
df.sort_values(by=["ID", "date"],inplace=True)

在上述数据框中,只有ID = 4的公司才能满足要求,我想删除其他公司。

我写了以下代码,但是有一个明显的问题,我不知道如何解决:

df['delete'] = 0
for name, group in df.groupby(by = "ID"):
    if group.shape[0] < 3:
        df.loc[df['ID']==name,'delete'] = 1
df = df.loc[df['delete'] == 0,:]

以上代码保留了ID = 1和ID = 4的两家公司;应该取消ID = 1,因为它包含4个数据点,但其中最多两个是连续的天(而我想施加至少3个)。

任何帮助将不胜感激。谢谢

3 个答案:

答案 0 :(得分:0)

IIUC使用diff + cumsumdate列创建组密钥New,然后我们只使用groupby + filter不需要的组

df['New']=df.groupby('ID').date.apply(lambda x : x.diff().dt.days.ne(1).cumsum())
yourdf=df.groupby(['ID','New']).filter(lambda x : len(x)>=3)
yourdf
Out[809]: 
   ID       date  variable  New
4   4 2015-01-12        28    1
5   4 2015-01-13        61    1
6   4 2015-01-14        62    1

答案 1 :(得分:0)

我认为您可以使用3天移动窗口并计数项目来替换“ group.shape [0]”。

df = pd.DataFrame({"ID":{"0":1,"1":1,"2":1,"3":1,"4":4,"5":4,"6":4,"7":2,"8":2,"9":3,"10":3},
    "date":{"0":1421020800000,"1":1421193600000,"2":1422489600000,"3":1423353600000,"4":1421020800000,"5":1421107200000,"6":1421193600000,"7":1421020800000,"8":1421107200000,"9":1421452800000,"10":1421539200000},
    "variable":{"0":28,"1":62,"2":60,"3":72,"4":28,"5":61,"6":62,"7":23,"8":70,"9":32,"10":55}})
df.date = pd.to_datetime(df.date, unit='ms')
df.sort_values(by=["ID", "date"],inplace=True)

df['delete'] = 0
for name, group in df.groupby(by = "ID"):
    group.set_index('date',inplace=True)

    if group.rolling(window='3D',min_periods=0).count()['delete'].max() < 3:
        df.loc[df['ID']==name,'delete'] = 1
df = df.loc[df['delete'] == 0,:]

答案 2 :(得分:0)

df['delete'] = 0
for name, group in df.groupby(by = "ID"):
    if group.shape[0] != 3:
        df.loc[df['ID']==name,'delete'] = 1
df = df.loc[df['delete'] == 0,:]

您可能在if group.shape[0] != 3中设置了错误

相关问题