如何使用groupby选择条件行?

时间:2019-06-03 14:38:58

标签: python pandas group-by pandas-groupby

我要选择具有groupby条件的行。

import pandas as pd
import numpy as np

dftest = pd.DataFrame({'A':['Feb',np.nan,'Air','Flow','Feb',
                            'Beta','Cat','Feb','Beta','Air'],
                       'B':['s','s','t','s','t','s','t','t','t','t'],
                       'C':[5,4,3,2,1,7,6,5,4,3],
                       'D':[4,np.nan,3,np.nan,2,
                            np.nan,2,3,np.nan,7]})
def filcols3(df,dd):
    if df.iloc[0]['D']==dd:
        return df
dd=4    
grp=dftest.groupby('B').apply(filcols3,dd)

grp的结果是:

         A  B  C    D
B                   
s 0   Feb  s  5  4.0
  1   NaN  s  4  NaN
  3  Flow  s  2  NaN
  5  Beta  s  7  NaN

这就是我想要的。

如果我使用以下代码(第2部分)

def filcols3(df,dd):
    if df.iloc[0]['D']<=dd:
        return df
dd=3

结果是:

       A    B    C    D
0   NaN  NaN  NaN  NaN
1   NaN  NaN  NaN  NaN
2   Air    t  3.0  3.0
3   NaN  NaN  NaN  NaN
4   Feb    t  1.0  2.0
5   NaN  NaN  NaN  NaN
6   Cat    t  6.0  2.0
7   Feb    t  5.0  3.0
8  Beta    t  4.0  NaN
9   Air    t  3.0  7.0

我为这个结果感到惊讶,我的意思是得到

      A  B  C    D
2   Air  t  3  3.0
4   Feb  t  1  2.0
6   Cat  t  6  2.0
7   Feb  t  5  3.0
8  Beta  t  4  NaN
9   Air  t  3  7.0

第2部分的代码有什么问题?如何获得我想要的最终结果?

2 个答案:

答案 0 :(得分:3)

apply的行为在这里有点不直观,但是,如果要根据每个组的特定条件过滤出整个组,则可以使用GroupBy.transform并获得掩码过滤器df

df[df.groupby('B')['D'].transform('first') <= 3]

      A  B  C    D
2  Air   t  3  3.0
4  Feb   t  1  2.0
6  Cat   t  6  2.0
7  Feb   t  5  3.0
8  Beta  t  4 NaN 
9  Air   t  3  7.0

或者,修正您的代码,

df[df.groupby('B')['D'].transform(lambda x: x.values[0] <= 3)]

      A  B  C    D
2  Air   t  3  3.0
4  Feb   t  1  2.0
6  Cat   t  6  2.0
7  Feb   t  5  3.0
8  Beta  t  4 NaN 
9  Air   t  3  7.0

答案 1 :(得分:3)

可以使用filter

进行检查
dftest.groupby('B').filter(lambda x : any(x['D'].head(1)<=3))
Out[538]: 
      A  B  C    D
2   Air  t  3  3.0
4   Feb  t  1  2.0
6   Cat  t  6  2.0
7   Feb  t  5  3.0
8  Beta  t  4  NaN
9   Air  t  3  7.0

groupby drop_duplicates

s=df.drop_duplicates('B').D<=3
df[df.B.isin(df.loc[s.index,'B'][s])]
Out[550]: 
      A  B  C    D
2   Air  t  3  3.0
4   Feb  t  1  2.0
6   Cat  t  6  2.0
7   Feb  t  5  3.0
8  Beta  t  4  NaN
9   Air  t  3  7.0