Question

我有一个包含近200列的数据集。所有这些列都是数字。但是我有3列非数字列，我想保留它们-我不想对它们进行分组。

示例：

team_ref num_1 num_2 num_3 matchday match_id season_id
a            1     1     1        A     AeD       2018  
a            2     2     2        B     AbD       2018
b            3     1     1        A     AeD       2018
b            4     2     2        B     AbD       2018

我想按team_ref分组并计算num1, num2, num3的平均值，但我想保留该事件的比赛日，match_id和season_id。

由于我要对数十列进行分组，因此使用agg并不是最好的主意。

关于如何执行此操作的任何建议？

致谢

Answer 1

我们可以做到

df.groupby('team_ref').agg(lambda x : x.mean() if x.dtype!= 'object' else ','.join(x))
Out[26]: 
          num_1  num_2  num_3 matchday match_id  season_id
team_ref                                                  
a           1.5    1.5    1.5      A,B  AeD,AbD       2018
b           3.5    1.5    1.5      A,B  AeD,AbD       2018

Answer 2

一种方法是先按team_ref分组，得到num_1-3的均值，然后将其余的分组为一个列表。最后将他们重新聚在一起：

import pandas as pd

df = pd.DataFrame({'team_ref': {0: 'a', 1: 'a', 2: 'b', 3: 'b'}, 'num_1': {0: 1, 1: 2, 2: 3, 3: 4}, 'num_2': {0: 1, 1: 2, 2: 1, 3: 2}, 'num_3': {0: 1, 1: 2, 2: 1, 3: 2}, 'matchday': {0: 'A', 1: 'B', 2: 'A', 3: 'B'}, 'match_id': {0: 'AeD', 1: 'AbD', 2: 'AeD', 3: 'AbD'}, 'season_id': {0: 2018, 1: 2018, 2: 2018, 3: 2018}})

grouped = df.groupby('team_ref')

new_df = grouped["num_1","num_2","num_3"].mean().reset_index().merge(
         grouped["matchday", "match_id","season_id"].agg(lambda x: x.tolist()).reset_index(),
         on="team_ref",how="left")

print (new_df)

#
  team_ref  num_1  num_2  num_3 matchday    match_id     season_id
0        a    1.5    1.5    1.5   [A, B]  [AeD, AbD]  [2018, 2018]
1        b    3.5    1.5    1.5   [A, B]  [AeD, AbD]  [2018, 2018]

对所有列进行分组并保留非数字

2 个答案: