pandas groupby对象索引变成了FUBAR

时间:2016-06-02 07:12:00

标签: python pandas indexing group-by

我有一只大熊猫,我的groupby操作将索引变成糊状。我需要将日期作为我的索引,在每个股票代码组中排序

举例说明。像这样设置熊猫:

import pandas as pd
from StringIO import StringIO

text = """Date   Ticker        Open        High         Low   Adj_Close   Volume
    2015-04-09  vws.co  315.000000  316.100000  312.500000  311.520000  1686800
    2015-04-10  vws.co  317.000000  319.700000  316.400000  312.700000  1396500
    2015-04-13  vws.co  317.900000  321.500000  315.200000  315.850000  1564500
    2015-04-14  vws.co  320.000000  322.400000  318.700000  314.870000  1370600
    2015-04-15  vws.co  320.000000  321.500000  319.200000  316.150000   945000
    2015-04-16  vws.co  319.000000  320.200000  310.400000  307.870000  2236100
    2015-04-17  vws.co  309.900000  310.000000  302.500000  299.100000  2711900
    2015-04-20  vws.co  303.000000  312.000000  303.000000  306.490000  1629700
    2016-03-31     mmm  166.750000  167.500000  166.500000  166.630005  1762800
    2016-04-01     mmm  165.630005  167.740005  164.789993  167.529999  1993700
    2016-04-04     mmm  167.110001  167.490005  165.919998  166.399994  2022800
    2016-04-05     mmm  165.179993  166.550003  164.649994  165.809998  1610300
    2016-04-06     mmm  165.339996  167.080002  164.839996  166.809998  2092200
    2016-04-07     mmm  165.880005  167.229996  165.250000  167.160004  2721900"""

df = pd.read_csv(StringIO(text), delim_whitespace=1, parse_dates=[0], index_col=0)

代码

import pandas as pd
from pandas.io.data import DataReader
import numpy as np
import time
import os

stocklist = ['vws.co','nflx','mmm']


print ('df.tail (Input df)\n',df.tail(6),'\n')


def Screener(group):

    def diff_calc(group):

        df['Difference'] = df['Adj_Close'].diff()
        return df['Difference']

    df['Difference'] = diff_calc(group)
    return df

if __name__ == '__main__':

    df = GetStock(stocklist, start, end)
    df['Adj_Close'] = df['Adj Close']

    for ticker in stocklist:
        ### groupby screeener (filtering to only rel ticker group)
        df = df.groupby('Ticker', as_index=False).Adj_Close.apply(Screener)

    df.reset_index().sort(['Ticker', 'Date'], ascending=[1,1]).set_index('Ticker')
    print ('(Output df)\n',df,'\n')

# Test the first 7 rows of each group for rolling_mean transgress groups...
df_test = df.groupby('Ticker').head(7).reset_index().set_index('Date')
print ('df_test (summary from df) (Output)\n',df_test,'\n')

显然我的索引现在搞砸了,我不知道这是怎么发生的。

(Output df)
                   Ticker    Open    High     Low  Adj Close  Adj_Close        Date                                                               
0 0 0 2016-05-20  vws.co  443.00  446.30  441.40     442.90     442.90   
      2016-05-23  vws.co  442.00  446.70  439.90     439.90     439.90   
      2016-05-24  vws.co  439.10  450.00  438.10     450.00     450.00   
      2016-05-25  vws.co  455.50  466.10  454.30     464.90     464.90   
      2016-05-26  vws.co  465.00  470.80  464.60     464.60     464.60   
      2016-05-27  vws.co  464.00  480.70  461.20     476.00     476.00   
      2016-05-30  vws.co  477.00  481.80  473.10     475.00     475.00   
      2016-05-31  vws.co  474.00  479.30  472.20     479.00     479.00   
      2016-06-01  vws.co  477.40  480.20  472.90     474.40     474.40   
      2016-05-20    nflx   90.08   93.28   89.98      92.49      92.49   
      2016-05-23    nflx   92.98   95.29   92.85      94.89      94.89   
      2016-05-24    nflx   95.98   99.14   95.75      97.89      97.89   
      2016-05-25    nflx   99.00  100.31   98.30     100.20     100.20   

我需要将日期作为我的索引,在每个自动收录器组中排序

有人可以帮忙吗?

3 个答案:

答案 0 :(得分:1)

好的,我终于找到了解决方案。这条线是我的秘密酱

df = df.reset_index(level=0, drop=True)

这个问题帮助我将索引恢复到我想要的状态。 How to get rows in pandas data frame, with maximal values in a column and keep the original index?

下面的代码将摆脱不必要的公关。迭代在索引中添加了col。谢谢你们!

import pandas as pd
from pandas.io.data import DataReader
import numpy as np
import time
import os
from io import StringIO

text = """Date   Ticker        Open        High         Low   Adj_Close   Volume
    2015-04-09  vws.co  315.000000  316.100000  312.500000  311.520000  1686800
    2015-04-10  vws.co  317.000000  319.700000  316.400000  312.700000  1396500
    2015-04-13  vws.co  317.900000  321.500000  315.200000  315.850000  1564500
    2015-04-14  vws.co  320.000000  322.400000  318.700000  314.870000  1370600
    2015-04-15  vws.co  320.000000  321.500000  319.200000  316.150000   945000
    2015-04-16  vws.co  319.000000  320.200000  310.400000  307.870000  2236100
    2015-04-17  vws.co  309.900000  310.000000  302.500000  299.100000  2711900
    2015-04-20  vws.co  303.000000  312.000000  303.000000  306.490000  1629700
    2016-03-31     mmm  166.750000  167.500000  166.500000  166.630005  1762800
    2016-04-01     mmm  165.630005  167.740005  164.789993  167.529999  1993700
    2016-04-04     mmm  167.110001  167.490005  165.919998  166.399994  2022800
    2016-04-05     mmm  165.179993  166.550003  164.649994  165.809998  1610300
    2016-04-06     mmm  165.339996  167.080002  164.839996  166.809998  2092200
    2016-04-07     mmm  165.880005  167.229996  165.250000  167.160004  2721900"""

df = pd.read_csv(StringIO(text), delim_whitespace=1, parse_dates=[0], index_col=0)
runstart = time.time()     # Start script timer

stocklist = ['vws.co','nflx','mmm']#,'msft','tsla']
tickers =   []

def Screener(group):

    def diff_calc(group):

        df['Difference'] = df['Adj_Close'].diff()
        return df['Difference']

    df['Difference'] = diff_calc(group)
    return df

if __name__ == '__main__':

    for ticker in stocklist:
        ### groupby screeener (filtering to only rel ticker group)
        df = df.groupby('Ticker', as_index=False).Adj_Close.apply(Screener) #.reset_index()

        df = df.reset_index(level=0, drop=True)

    print ('(Output df)\n',df,'\n')

    # Test the first 7 rows of each group for rolling_mean transgress groups...
    df_test = df.groupby('Ticker').head(7).reset_index().set_index('Date')
    print ('df_test (summary from df) (Output)\n',df_test,'\n')

答案 1 :(得分:0)

groupby函数会自动将您分组的列设置为索引。为了避免这种情况,请在您的groupby中通过选项:as_index = False

df_test = df.groupby('Ticker', as_index = False).head(7).reset_index().set_index('Date')

答案 2 :(得分:0)

diff对象的GroupBy方法将日期保留为索引:

# sort if needed (use 'mergesort' algorithm to preserve date order)
df = df.sort().sort_values('Ticker', kind='mergesort')

df['Difference'] = df.groupby('Ticker')['Adj_Close'].diff()

注意:如果输入未排序,则需要先对其进行排序。上面的代码首先按索引(即日期)排序,然后使用mergesort按报价排序以保留日期的顺序。

<强>输出

In [21]: df[['Ticker', 'Adj_Close', 'Difference']]
Out[21]: 
            Ticker   Adj_Close  Difference
Date                                      
2015-04-09  vws.co  311.520000         NaN
2015-04-10  vws.co  312.700000    1.180000
2015-04-13  vws.co  315.850000    3.150000
2015-04-14  vws.co  314.870000   -0.980000
2015-04-15  vws.co  316.150000    1.280000
2015-04-16  vws.co  307.870000   -8.280000
2015-04-17  vws.co  299.100000   -8.770000
2015-04-20  vws.co  306.490000    7.390000
2016-03-31     mmm  166.630005         NaN
2016-04-01     mmm  167.529999    0.899994
2016-04-04     mmm  166.399994   -1.130005
2016-04-05     mmm  165.809998   -0.589996
2016-04-06     mmm  166.809998    1.000000
2016-04-07     mmm  167.160004    0.350006