熊猫-分别对每列进行有效分组

时间:2019-12-21 13:05:57

标签: python pandas numpy dataframe dask

我需要分别对每一列进行分组和分组,以找出一些指标。 假设我有一堆功能列和一个二进制目标列。每个功能都是一个bin(一个字符串)。目标是整数列。在这里,为简单起见,只有1和0。

示例

import pandas as pd


var1 = ['var1_bin1', 'var1_bin2', 'var1_bin2', 'var1_bin3', 'var1_bin4', 'var1_bin4', 'var1_bin4', 'var1_bin5', 'var1_bin5', 'var1_bin5']
var2 = ['var2_bin1', 'var2_bin1', 'var2_bin2', 'var2_bin3', 'var2_bin3', 'var2_bin4', 'var2_bin4', 'var2_bin5', 'var2_bin5', 'var2_bin5']
var3 = ['var3_bin2', 'var3_bin2', 'var3_bin2', 'var3_bin3', 'var3_bin3', 'var3_bin3', 'var3_bin3', 'var3_bin4', 'var3_bin5', 'var3_bin5']
var4 = ['var4_bin1', 'var4_bin1', 'var4_bin2', 'var4_bin2', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4', 'var4_bin4']
target = [1, 0, 0, 1, 1, 1, 0, 0, 0, 0]

df = pd.DataFrame({
    'var1' : var1,
    'var2' : var2,
    'var3' : var3,
    'target' : target
})

print(df)
cols = ['var1', 'var2', 'var3', 'var4', 'target']

# need groupby for each column separately:
#  For each column, I want to group by categorical elements in column and sum elements from target variable and also count how many zeros are there

for col in cols:
    x = df.groupby([col, target])[[target]].sum() #expecting aggregated metrics
    print(x)

我期望的是,作为数据帧(或其他更好方法)的数据帧的结果,在视觉上我可以与您进行如下交流:

Result representation
        var1                     | var2 ...
    ---------------------------- |
    bin    | sum | total_zeros   |
      -----------------          |
var1_bin1  | 1   | 0             |
var1_bin2  | 0   | 2             |
var1_bin3  | 1   | 0             |
var1_bin4  | 2   | 1             |
var1_bin5  | 0   | 3             |

2 个答案:

答案 0 :(得分:3)

大熊猫答案

我们可以通过首先使用DataFrame.columns使用for col in df.columns遍历所有列来实现此目的

然后我们在这些列上GroupBy,并使用GroupBy.agg。在此汇总中,我们采用目标sumtotal zeros

最后,我们使用pd.concat来使每个组彼此相邻。

dfg = pd.concat([
    (df.groupby(col)['target']
       .agg([(f'sum_{col}', 'sum'),(f'total_zeros_{col}', lambda x: x.eq(0).sum())])
       .reset_index()
    ) for col in df.columns
], axis=1)
        var1  sum_var1  total_zeros_var1       var2  sum_var2  total_zeros_var2       var3  sum_var3  total_zeros_var3       var4  sum_var4  total_zeros_var4  target  sum_target  total_zeros_target
0  var1_bin1         1                 0  var2_bin1         1                 1  var3_bin2      1.00              2.00  var4_bin1      1.00              1.00    0.00        0.00                6.00
1  var1_bin2         0                 2  var2_bin2         0                 1  var3_bin3      3.00              1.00  var4_bin2      1.00              1.00    1.00        4.00                0.00
2  var1_bin3         1                 0  var2_bin3         2                 0  var3_bin4      0.00              1.00  var4_bin4      2.00              4.00     nan         nan                 nan
3  var1_bin4         2                 1  var2_bin4         1                 1  var3_bin5      0.00              2.00        NaN       nan               nan     nan         nan                 nan
4  var1_bin5         0                 3  var2_bin5         0                 3        NaN       nan               nan        NaN       nan               nan     nan         nan                 nan

答案 1 :(得分:0)

因为性能很重要,所以要在0之前而不是每个组中计算groupby值,所以对于两个列的汇总sum来说可能是计数:

df1 = pd.concat([
    (df.assign(total_zeros = df[col].eq(0).astype(int))
       .groupby(col)['target','total_zeros']
       .sum()
       .add_suffix(f'_{col}')
       .reset_index()
    ) for col in df.columns
], axis=1)

print(df1)
        var1  target_var1  total_zeros_var1       var2  target_var2  \
0  var1_bin1            1                 0  var2_bin1            1   
1  var1_bin2            0                 0  var2_bin2            0   
2  var1_bin3            1                 0  var2_bin3            2   
3  var1_bin4            2                 0  var2_bin4            1   
4  var1_bin5            0                 0  var2_bin5            0   

   total_zeros_var2       var3  target_var3  total_zeros_var3  target  \
0                 0  var3_bin2          1.0               0.0     0.0   
1                 0  var3_bin3          3.0               0.0     1.0   
2                 0  var3_bin4          0.0               0.0     NaN   
3                 0  var3_bin5          0.0               0.0     NaN   
4                 0        NaN          NaN               NaN     NaN   

   target_target  total_zeros_target  
0            0.0                 6.0  
1            4.0                 0.0  
2            NaN                 NaN  
3            NaN                 NaN  
4            NaN                 NaN