在一个列的数据框中求和,同时保留其他列

时间:2018-02-13 07:09:34

标签: python pandas group-by sum

在pandas Dataframe df中我有这样的列:

    NAME    KEYWORD  AMOUNT  INFO
0   orange  fruit    13      from italy
1   potato  veggie   7       from germany
2   potato  veggie   9       from germany
3   orange  fruit    8       from italy
4   potato  veggie   6       from germany

执行groupby KEYWORD操作我想构建每个组的AMOUNT值之和,并保持其他列始终是第一个值,以便结果读取:

    NAME    KEYWORD  AMOUNT  INFO
0   orange  fruit    21      from italy
1   potato  veggie   22      from germany

我试过

df.groupby('KEYWORD).sum()

但是这个"总结了#34;在所有列上,即我得到

    NAME                KEYWORD  AMOUNT  INFO
0   orangeorange        fruit    21      from italyfrom italy
1   potatopotatopotato  veggie   22      from germanyfrom germanyfrom germany

然后我尝试对不同的列使用不同的函数:

df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})

def first(f_arg, *args):
    return f_arg

但不幸的是,这给了我一个" ValueError: function does not reduce"错误。

所以我有点不知所措。如何将sum仅应用于AMOUNT列,同时保留其他列?

2 个答案:

答案 0 :(得分:2)

groupby + agg与自定义aggfunc dict一起使用。

f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum

df = df.groupby('KEYWORD', as_index=False).agg(f)
df

  KEYWORD    NAME  AMOUNT          INFO
0   fruit  orange      21    from italy
1  veggie  potato      22  from germany

dict.fromkeys为我提供了一个很好的方法来推广N个列。如果列顺序很重要,请在末尾添加reindex操作:

df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df

     NAME KEYWORD  AMOUNT          INFO
0  orange   fruit      21    from italy
1  potato  veggie      22  from germany

答案 1 :(得分:1)

按列KEYWORD使用drop_duplicates,然后使用assign聚合值:

df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
     NAME KEYWORD  AMOUNT          INFO
0  orange   fruit      21    from italy
1  potato  veggie      22  from germany