根据数据框中的两列删除异常值

时间:2019-02-13 13:29:21

标签: python pandas dataframe

我有一个数据框,如下所示:

Year Month Equipment   Weight
2017 1     TennisBall  5
2017 1     Football    4
2017 1     TennisBall  6
2017 1     TennisBall  7
2017 1     TennisBall  300
2017 2     TennisBall  300
2018 2     TennisBall  250
2018 2     Football    5
2018 2     TennisBall  6
2018 2     TennisBall  275
...

在上面的示例中,正常情况下,我们仅在2月份才装运300单位的网球,因此使6单位的订单离群,而在一月​​份,正常数量是〜5,因此当月较大的订单数量。我想根据每月的重量来剔除异常值。有没有简单的方法可以做到这一点?我知道我可以按照以下方式做些事情:

df1[np.abs(df1.Weight-df1.Weight.mean()) <= (5*df1.Weight.std())]

获取重量在平均值5个偏差以内的任何东西,但这不会考虑按月部分,在这里我可以看到由于月份的不同,重量发生了巨大变化。谢谢!

编辑: 例如,所需的输出将是这样的:

Year Month Equipment   Weight
2017 1     TennisBall  5
2017 1     Football    4
2017 1     TennisBall  6
2017 1     TennisBall  7

2017 2     TennisBall  300
2018 2     TennisBall  250
2018 2     Football    5

2018 2     TennisBall  275
...

在1月中300的离群值被删除(如在1月中高于正常值),在2月中6的离群值被删除(在1月中处于正常值,但是正如2月中发生的那样,这不是正常值)正常)

1 个答案:

答案 0 :(得分:1)

这是groupby的问题。您可以通过创建两个包含分组的均值和标准差的新列,然后对这些列进行过滤来解决此问题:

# Calculate difference between Weight and mean of group
df['Weight diff'] = df['Weight'].sub(df.groupby(['Year','Month','Equipment'])['Weight'].transform('mean'))
# Calculate standard deviation of group
df['std'] = df.groupby(['Year','Month','Equipment'])['Weight'].transform('std')

# Consider columns satisfying condition
# Include or condition accounting for NaN's from single value groups
df = df.loc[(np.abs(df['Weight diff']) <= df['std']) | (df['std'].isnull())]

# Remove unnecessary columns
df = df.drop(['Weight diff', 'std'], axis=1)

>>> print(df)

0   Year Month   Equipment  Weight
1   2017     1  TennisBall       5
2   2017     1    Football       4
3   2017     1  TennisBall       6
4   2017     1  TennisBall       7
6   2017     2  TennisBall     300
7   2018     2  TennisBall     250
8   2018     2    Football       5
10  2018     2  TennisBall     275