groupby - 仅选择某些组

时间:2018-02-05 21:19:58

标签: python pandas pandas-groupby

我有以下DataFrame,我想选择服务,其中该服务的实例少于2个“健康”实例。在这种情况下,我想要系列(EmailService,UserService,NotificationService)

              CPU              Service  Memory   Status
IP                                                     
10.22.11.150   13       StorageService      55  Healthy
10.22.11.90    23       StorageService      19  Healthy
10.22.11.91    10         EmailService      44  Healthy
10.22.11.92    69          UserService       1  Healthy
10.22.11.93    63  NotificationService      81  Healthy
10.22.11.93    87  NotificationService      98  Unhealthy

我想我需要这个分组,

grouped = servers_df.groupby('Service')

但不确定如何计算“状态”列,然后根据该列获取结果。

3 个答案:

答案 0 :(得分:3)

transform与lambda函数一起用于计数Healthy并进行比较,最后按boolean indexing过滤:

df = df[df.groupby('Service')['Status'].transform(lambda x: (x=='Healthy').sum() < 2)]
print (df)
             CPU              Service  Memory     Status
IP                                                      
10.22.11.91   10         EmailService      44    Healthy
10.22.11.92   69          UserService       1    Healthy
10.22.11.93   63  NotificationService      81    Healthy
10.22.11.93   87  NotificationService      98  Unhealthy

如果要为每个组仅检查1个值Healthy,请对所有欺骗使用duplicated keep=False,并将其与条件进行链接以进行比较Healthy以筛选出多个{{1然后按Unhealthy反转条件并再次过滤~

boolean indexing

答案 1 :(得分:1)

您也可以使用filter

df.groupby("Service").filter(lambda x: len(x[x.Status == "Healthy"]) < 2)

根据jezrael's experiment in this answer

,速度可能会慢一些

另一种方式:使用apply(从jezrael&#39;转换解决方案修改)

df.groupby('Service').apply(
                   lambda x: x if (x.Status == 'Healthy').sum() < 2 else None)


                        IP         CPU  Service              Memory Status
Service                     
EmailService        2   10.22.11.91 10  EmailService         44 Healthy
NotificationService 4   10.22.11.93 63  NotificationService  81 Healthy
                    5   10.22.11.93 87  NotificationService  98 Unhealthy
UserService         3   10.22.11.92 69  UserService          1  Healthy

答案 2 :(得分:1)

IIUC

s=df[df.Status=='Healthy'].groupby('Service').Service.count().lt(2)
df.loc[df.Service.isin(s[s].index)]

    IP          CPU Service             Memory  Status
2   10.22.11.91 10  EmailService        44      Healthy
3   10.22.11.92 69  UserService         1       Healthy
4   10.22.11.93 63  NotificationService 81      Healthy
5   10.22.11.93 87  NotificationService 98      Unhealthy