Pandas,计算每个MultiIndex子级的总和

时间:2015-04-02 12:53:34

标签: python pandas

我想计算每个多指数子级的总和。然后,将其保存在数据框中。

我当前的数据框如下:

                    values
    first second
    bar   one     0.106521
          two     1.964873
    baz   one     1.289683
          two    -0.696361
    foo   one    -0.309505
          two     2.890406
    qux   one    -0.758369
          two     1.302628

所需的结果是:

                    values
    first second
    bar   one     0.106521
          two     1.964873
          total   2.071394
    baz   one     1.289683
          two    -0.696361
          total   0.593322
    foo   one    -0.309505
          two     2.890406
          total   2.580901
    qux   one    -0.758369
          two     1.302628
          total   0.544259
    total one     0.328331
          two     5.461546
          total   5.789877

目前我发现下面的实现有效。但我想知道是否有更好的选择。我需要尽可能快的解决方案,因为在某些情况下,当我的数据帧变得庞大时,计算时间似乎需要很长时间。

In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: index = MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [4]: s = Series(randn(8), index=index)

In [5]: d = {'values': s}

In [6]: df = DataFrame(d)

In [7]: for col in df.index.names:
   .....:     df = df.unstack(col)
   .....:     df[('values', 'total')] = df.sum(axis=1)
   .....:     df = df.stack()
   .....:

3 个答案:

答案 0 :(得分:1)

不确定您是否仍在寻找答案-假设您当前的数据帧已分配给df,您可以尝试类似的方法:

temp = df.pivot(index='first', columns='second', values='values')
temp['total'] = temp['one'] + temp['two']
temp.stack()

答案 1 :(得分:0)

相当难看的代码:

In [162]:

print df
                values
first second          
bar   one     0.370291
      two     0.750565
baz   one     0.148405
      two     0.919973
foo   one     0.121964
      two     0.394017
qux   one     0.883136
      two     0.871792
In [163]:

print pd.concat((df.reset_index(),
                 df.reset_index().groupby('first').aggregate('sum').reset_index())).\
                      sort(['first','second']).\
                      fillna('total').\
                      set_index(['first','second'])
                values
first second          
bar   one     0.370291
      two     0.750565
      total   1.120856
baz   one     0.148405
      two     0.919973
      total   1.068378
foo   one     0.121964
      two     0.394017
      total   0.515981
qux   one     0.883136
      two     0.871792
      total   1.754927

基本上,由于需要计算额外的行' total'并将其插入到原始数据帧中,因此它不会是原始数据与结果之间的一对一关系,也不是这种关系是多对一的。所以,我认为你必须产生总数'数据框是单独的,concat是原始数据帧。

答案 2 :(得分:0)

我知道这是一个古老的话题,但是-我找不到任何令人满意的解决方法可以在大熊猫中卷起来,而实际上我可以看到其中的一些价值。

#to retain original index:
index_cols=df.index.names

df2=pd.DataFrame()
#we iterate over each sub index, except the last one - to get sub-sums
for i in range(-1,len(df.index[0])-1):
    if i>=0:
        df2=df2.append(df.sum(level=list(range(i+1))).reset_index(), ignore_index=True)
    else: #-1 will handle the total sum
        df2=df2.append(df.sum(), ignore_index=True)
#to mask the last index, for which the sub-sum was not calculated:
df2[index_cols[-1]]=np.nan

#might be done better- you can keep it as "nan" (you would comment out the below line then), which will force it to the last position in index, after sorting, or put some special character in front
df2[index_cols]=df2[index_cols].fillna("_total")

df=df.reset_index().append(df2, sort=True).set_index(index_cols).sort_values(index_cols, ascending=False)

对于我的示例数据:

               values
first  second
qux    two       -4.0
       one        2.0
       _total    -2.0
foo    two       -3.0
       one        4.0
       _total     1.0
baz    two        5.0
       one       -1.0
       _total     4.0
bar    two       -1.0
       one        2.0
       _total     1.0
_total _total     4.0