在groupby和sum之后,如何获取`pandas.DataFrame`中的最大值行?

时间:2015-07-23 08:05:34

标签: python pandas dataframe

这里是df(我用真实数据更新):

>TIMESTAMP          OLTPSOURCE      RNR                         RQDRECORD
>20150425232836     0PU_IS_PS_44    REQU_51NHAJUV06IMMP16BVE572JM2  17020
>20150128165726     ZFI_DS41        REQU_50P1AABLYXE86KYE3O6EY390M  6925
>20150701144253     ZZZJB_TEXT      REQU_52DV5FB812JCDXDVIV9P35DGM  2
>20150107201358     0EQUIPMENT_ATTR     REQU_50EVHXSDOITYUQLP4L8UXOBT6  14205
>20150623215202     0CO_OM_CCA_1     REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150715144139     0HRPOSITION_TEXT    REQU_52I9KQ1LN4ZWTNIP0N1R68NDY  25381
>20150625175157     0HR_PA_0    REQU_528ZS1RFN0N3Y3AEB48UDCUKQ  100020
>20150309153828     0HR_PA_0    REQU_51385K5F3AGGFVCGHU997QF9M  0
>20150626185531     0FI_AA_001  REQU_52BO3RJCOG4JGHEIIZMJP9V4A  0
>20150307222336     0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ  13889
>20150630163419     0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2  0
>20150424162226     6DB_V_DGP_EXPORTDATA    REQU_51N1F5ZC8G3LW68E4TFXRGH9I  0
>20150617143720     ZRZMS_TEXT  REQU_5268R1YE6G1U7HUK971LX1FPM  6
>20150405162213     0HR_PA_0    REQU_51FFR7T4YQ2F766PFY0W9WUDM  0
>20150202165933     ZFI_DS41    REQU_50QPTCF0VPGLBYM9MGFXMWHGM  6925
>20150102162140     0HR_PA_0    REQU_50CNUT7I9OXH2WSNLC4WTUZ7U  0
>20150417184916     0FI_AA_004  REQU_51KFWWT6PPTI5X44D3MWD7CYU  0
>20150416220451     0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU  13889
>20150205150633     ZHR_DS09    REQU_50RFRYRADMA9QXB1PW4PRF5XM  6667
>20150419230724     0PU_IS_PS_44    REQU_51LC5XX6VWEERAVHEFJ9K5A6I  22528

>and the relationships between columns is
>OLTPSOURCE--RNR:1>n
>RNR--RQDRECORD:1>N
  

我的要求是:

  1. 通过RNR总结RQDRECORD;
  2. 获取每个OLTPSOURCE的最大总和结果;
  3. 最后,我会绘制一张显示所有结果的图表 时间最长的OLTPSOURCE
  4. 谢谢大家,我进一步解释了我的问题:

    1. 如果OLTPSOURCE:RNR:RQDRECORD = 1:1:1
        

      只需加总RQDRECORD,返回OLTPSOURCE和SUM RESULT

    2. 如果OLTPSOURCE:RNR:RQDRECORD = 1:1:N
        

      只需加总RQDRECORD,返回OLTPSOURCE和SUM RESULT

    3. 如果OLTPSOURCE:RNR:RQDRECORD = 1:N:(N OR 1)
        

      首先通过RNR GROUP对RQDRECORD求和,然后找到一个OLTPSOURCE的最大结果,用最大RQDRECORD返回所有OLTPSOURCE。

    4. 因此,对于上面的示例数据,我最终希望结果如下

      >TIMESTAMP  OLTPSOURCE  RNR RQDRECORD
      >20150623215202     0CO_OM_CCA_1    REQU_528XSXYWTK6FSJXDQY2ROQQ4Q  0
      >20150107201358     0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6  14205
      >20150626185531     0FI_AA_001  REQU_52BO3RJCOG4JGHEIIZMJP9V4A  0
      >20150417184916     0FI_AA_004  REQU_51KFWWT6PPTI5X44D3MWD7CYU  0
      >20150416220451     0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU  13889
      >20150625175157     0HR_PA_0    REQU_528ZS1RFN0N3Y3AEB48UDCUKQ  100020
      >20150715144139     0HRPOSITION_TEXT    REQU_52I9KQ1LN4ZWTNIP0N1R68NDY  25381
      >20150419230724     0PU_IS_PS_44    REQU_51LC5XX6VWEERAVHEFJ9K5A6I  22528
      >20150630163419     0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2  0
      >20150424162226     6DB_V_DGP_EXPORTDATA    REQU_51N1F5ZC8G3LW68E4TFXRGH9I  0
      >20150202165933     ZFI_DS41    REQU_50QPTCF0VPGLBYM9MGFXMWHGM  6925
      >20150205150633     ZHR_DS09    REQU_50RFRYRADMA9QXB1PW4PRF5XM  6667
      >20150617143720     ZRZMS_TEXT  REQU_5268R1YE6G1U7HUK971LX1FPM  6
      >20150701144253     ZZZJB_TEXT  REQU_52DV5FB812JCDXDVIV9P35DGM  2
      

      参考EdChum的方法,我做了一些调整,结果如下,因为数据量太大,我做了"' RQDRECORD> 100000'"设置,实际上我想排序然后进入前100名,但没有成功

        

      [1]:http://i.imgur.com/FgfZaDY.jpg"结果"

2 个答案:

答案 0 :(得分:0)

您可以获取groupby结果,在此处调用max并传递参数level=0level='clsa',如果您愿意,这将返回该级别的最大数量。但是,这会丢失'clsb'列,因此您可以在分组对象上调用merge之后将reset_index返回到分组结果,您可以使用花式索引重新排序生成的df列: / p>

In [149]:
gp = df.groupby(['clsa','clsb']).sum()
result = gp.max(level=0).reset_index().merge(gp.reset_index())
result = result.ix[:,['clsa','clsb','count']]
result

Out[149]:
  clsa clsb  count
0    a   a1      9
1    b   b2      8
2    c   c2     10

答案 1 :(得分:0)

df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%Y%m%d%H%M%S')
df_gb = df.groupby(['OLTPSOURCE', 'RNR'], as_index=False).aggregate(sum)
final = pd.merge(df[['TIMESTAMP', 'OLTPSOURCE', 'RNR']], df_gb.groupby(['OLTPSOURCE'], as_index=False).first(), on=['OLTPSOURCE', 'RNR'], how='right').sort('OLTPSOURCE')
final.plot(kind='bar')
plt.show()


print final

             TIMESTAMP            OLTPSOURCE                             RNR  \
3  2015-06-23 21:52:02          0CO_OM_CCA_1  REQU_528XSXYWTK6FSJXDQY2ROQQ4Q   
2  2015-01-07 20:13:58       0EQUIPMENT_ATTR  REQU_50EVHXSDOITYUQLP4L8UXOBT6   
5  2015-06-26 18:55:31            0FI_AA_001  REQU_52BO3RJCOG4JGHEIIZMJP9V4A   
11 2015-04-17 18:49:16            0FI_AA_004  REQU_51KFWWT6PPTI5X44D3MWD7CYU   
6  2015-03-07 22:23:36       0FUNCT_LOC_ATTR  REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ   
4  2015-07-15 14:41:39      0HRPOSITION_TEXT  REQU_52I9KQ1LN4ZWTNIP0N1R68NDY   
10 2015-01-02 16:21:40              0HR_PA_0  REQU_50CNUT7I9OXH2WSNLC4WTUZ7U   
13 2015-04-19 23:07:24          0PU_IS_PS_44  REQU_51LC5XX6VWEERAVHEFJ9K5A6I   
7  2015-06-30 16:34:19       0WBS_ELEMT_ATTR  REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2   
8  2015-04-24 16:22:26  6DB_V_DGP_EXPORTDATA  REQU_51N1F5ZC8G3LW68E4TFXRGH9I   
0  2015-01-28 16:57:26              ZFI_DS41  REQU_50P1AABLYXE86KYE3O6EY390M   
12 2015-02-05 15:06:33              ZHR_DS09  REQU_50RFRYRADMA9QXB1PW4PRF5XM   
9  2015-06-17 14:37:20            ZRZMS_TEXT  REQU_5268R1YE6G1U7HUK971LX1FPM   
1  2015-07-01 14:42:53            ZZZJB_TEXT  REQU_52DV5FB812JCDXDVIV9P35DGM   

    RQDRECORD  
3           0  
2       14205  
5           0  
11          0  
6       13889  
4       25381  
10          0  
13      22528  
7           0  
8           0  
0        6925  
12       6667  
9           6  
1           2