熊猫:带自定义功能的reindex

时间:2016-06-19 11:29:03

标签: python pandas dataframe

我正在寻找一种使用自定义函数重新索引数据的方法。我的数据如下:

                        AAA    BBB    CCC    DDD
Time                                             
2009-01-30 09:30:00  6407.04  43.90  44.01  85.11
2009-01-30 09:39:00  6403.20  43.82  44.01  84.93
2009-01-30 09:40:00  6400.00  43.90  44.03  84.90
2009-01-30 09:45:00  6396.16  43.97  44.04  84.91
2009-01-30 09:48:00  6393.60  44.02  44.07  84.81
2009-01-30 09:55:00  6400.00  44.31  44.14  84.78
2009-01-30 09:56:00  6406.40  44.36  44.16  84.57
2009-01-30 09:59:00  6426.24  44.36  44.11  84.25
2009-01-30 10:00:00  6438.40  44.32  44.09  84.32
2009-01-30 10:06:00  6495.36  44.43  44.16  84.23  

它是一些股票价格的分钟数据。我想将交易日分成5个部分并重新采样我的数据。 我从创建自定义索引开始:

index_date = pd.date_range('2009-01-30', '2016-03-01')
    index_date = pd.Series(index_date)
    index_time = pd.date_range('09:30:00', '16:00:00', freq='78min')
    index_time = pd.Series(index_time.time)

    index = index_date.apply(
        lambda d: index_time.apply(
            lambda t: datetime.combine(d, t)
            )
        ).unstack().sort_values().reset_index(drop=True)

让我们假设我想应用基本的百分比变化函数:

def percent_change(x):
    if len(x):
        return (x[-1]-x[0])/x[0]

所需的数据集sholud如下所示:

                      AAA    BBB  CCC  DDD

2009-01-30 09:30:00    NaN   NaN  NaN  NaN
2009-01-30 10:48:00     y     y    y    y    # where y is the output of the    
2009-01-30 12:06:00     x     x    x    x      percent_change function from
2009-01-30 13:24:00                            9:30 to 14:48
2009-01-30 14:42:00                          # x is the output of the
2009-01-30 16:00:00                            percent_change function 
2009-01-31 09:30:00                            from 10:49 to 12:06, etc
2009-01-31 10:48:00

我可以在此处找到更大的数据示例: https://www.dropbox.com/s/h29xlpveb1o7p2u/data.csv?dl=0
我怎么能这样做?

1 个答案:

答案 0 :(得分:3)

<强>更新

In [182]: %paste
(df.groupby(df.index.date)
   .apply(lambda x: x.resample('78T',
                               loffset=pd.Timedelta('24minute')).mean())
   .ffill()
   .pct_change()
)
## -- End pasted text --
Out[182]:
                                    vxxc
           Time
2009-02-02 2009-02-02 09:30:00       NaN
           2009-02-02 10:48:00 -0.010745
           2009-02-02 12:06:00 -0.006372
           2009-02-02 13:24:00 -0.003701
           2009-02-02 14:42:00  0.001614
           2009-02-02 16:00:00 -0.005668
2009-02-03 2009-02-03 09:30:00 -0.009334
           2009-02-03 10:48:00 -0.007039
           2009-02-03 12:06:00 -0.002014
           2009-02-03 13:24:00 -0.002705
           2009-02-03 14:42:00 -0.017530
           2009-02-03 16:00:00 -0.004704
           2009-02-03 17:18:00 -0.001893
2009-02-04 2009-02-04 09:30:00 -0.019076
           2009-02-04 10:48:00 -0.002563
           2009-02-04 12:06:00  0.002348
           2009-02-04 13:24:00  0.010099
           2009-02-04 14:42:00  0.013081
           2009-02-04 16:00:00 -0.000264
           2009-02-04 17:18:00  0.007121
2009-02-05 2009-02-05 09:30:00  0.026527
           2009-02-05 10:48:00 -0.013580
           2009-02-05 12:06:00 -0.018056
           2009-02-05 13:24:00 -0.005020
           2009-02-05 14:42:00 -0.006316
           2009-02-05 16:00:00  0.003269
2009-02-06 2009-02-06 09:30:00 -0.030773
           2009-02-06 10:48:00  0.001088
           2009-02-06 12:06:00  0.010469
           2009-02-06 13:24:00 -0.008337
...                                  ...
2009-02-23 2009-02-23 09:30:00  0.002312
           2009-02-23 10:48:00  0.012162
           2009-02-23 12:06:00  0.009785
           2009-02-23 13:24:00  0.008687
           2009-02-23 14:42:00  0.000421
           2009-02-23 16:00:00  0.012550
2009-02-24 2009-02-24 09:30:00 -0.009290
           2009-02-24 10:48:00 -0.017526
           2009-02-24 12:06:00 -0.004194
           2009-02-24 13:24:00 -0.021528
           2009-02-24 14:42:00 -0.027898
           2009-02-24 16:00:00 -0.012646
2009-02-25 2009-02-25 09:30:00  0.021827
           2009-02-25 10:48:00  0.001863
           2009-02-25 12:06:00 -0.012693
           2009-02-25 13:24:00 -0.006884
           2009-02-25 14:42:00 -0.013019
           2009-02-25 16:00:00 -0.008020
2009-02-26 2009-02-26 09:30:00 -0.015104
           2009-02-26 10:48:00 -0.011319
           2009-02-26 12:06:00  0.019160
           2009-02-26 13:24:00  0.016271
           2009-02-26 14:42:00  0.003807
           2009-02-26 16:00:00  0.007333
2009-02-27 2009-02-27 09:30:00  0.023949
           2009-02-27 10:48:00 -0.027659
           2009-02-27 12:06:00 -0.006932
           2009-02-27 13:24:00 -0.003167
           2009-02-27 14:42:00  0.005263
           2009-02-27 16:00:00  0.010594

[118 rows x 1 columns]

OLD回答:

你可以这样做:

In [104]: df.resample('18T').pct_change()
C:\envs\py35\Scripts\ipython:1: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
Out[104]:
                          AAA       BBB       CCC       DDD
Time
2009-01-30 09:18:00       NaN       NaN       NaN       NaN
2009-01-30 09:36:00 -0.001373  0.000626  0.000625 -0.002614
2009-01-30 09:54:00  0.005477  0.009755  0.002146 -0.005389

或者如果我们想摆脱FutureWarning

In [109]: df.resample('18T').mean().pct_change()
Out[109]:
                          AAA       BBB       CCC       DDD
Time
2009-01-30 09:18:00       NaN       NaN       NaN       NaN
2009-01-30 09:36:00 -0.001373  0.000626  0.000625 -0.002614
2009-01-30 09:54:00  0.005477  0.009755  0.002146 -0.005389

注意:我使用了18分钟而非78T,因为您的示例数据的数据少于78分钟,因此将18T更改为{ {1}}用于您的真实数据集