reindex multiindex pandas数据帧

时间:2016-03-01 00:49:22

标签: python-2.7 pandas

问候。

我正在努力想弄清楚如何在熊猫中进行下一步操作:

我有一个带有时间戳的csv文件,如下所示:

head of the file

我接下来要做的是使用pandas的以下pivot_table:

trips.pivot_table('bike', aggfunc='count',
                        index=['date', 'hour'],
                        columns='station_arrived').fillna(0)

返回这样的内容:

enter image description here

我的问题如下:

我想重新索引“小时”列,使索引每天从0到23小时,即使当天没有计数。

只使用一个索引执行reindex很容易,但是当我在多索引数据框中尝试这个时,事情变得复杂了

有没有办法让这成为可能?

1 个答案:

答案 0 :(得分:2)

import datetime as dt
import pandas as pd
from pandas import Timestamp

df = pd.DataFrame(
    {'action': ['C', 'C', 'C', 'C', 'C', 'A', 'C'],
     'bike': [89, 89, 57, 29, 76, 69, 17],
     'cust_id': [6, 6, 30, 30, 30, 30, 30],
     'date': [Timestamp('2010-02-02 00:00:00'),
              Timestamp('2010-02-02 00:00:00'),
              Timestamp('2010-02-05 00:00:00'),
              Timestamp('2010-02-05 00:00:00'),
              Timestamp('2010-02-05 00:00:00'),
              Timestamp('2010-02-05 00:00:00'),
              Timestamp('2010-02-05 00:00:00')],
     'date_arrived': [Timestamp('2010-02-02 14:27:00'),
                      Timestamp('2010-02-02 15:42:00'),
                      Timestamp('2010-02-05 12:06:00'),
                      Timestamp('2010-02-05 12:07:00'),
                      Timestamp('2010-02-05 13:11:00'),
                      Timestamp('2010-02-05 13:14:00'),
                      Timestamp('2010-02-05 13:45:00')],
     'date_removed': [Timestamp('2010-02-02 13:57:00'),
                      Timestamp('2010-02-02 15:12:00'),
                      Timestamp('2010-02-05 11:36:00'),
                      Timestamp('2010-02-05 11:37:00'),
                      Timestamp('2010-02-05 12:41:00'),
                      Timestamp('2010-02-05 12:44:00'),
                      Timestamp('2010-02-05 13:15:00')],
     'hour': [14, 15, 12, 12, 13, 13, 13],
     'station_arrived': [56, 56, 85, 85, 85, 85, 85],
     'station_removed': [56, 56, 85, 85, 85, 85, 85]})

首先,创建一个跨越日期范围的小时索引:

idx = pd.date_range(df.date.min(), df.date.max() + dt.timedelta(days=1), freq='H')

现在您想要一个日期时间索引,因此将其设置为“date_arrived”。然后使用groupby同时将TimeGrouper分组到小时和station_arrivedcount非空station_arrived值的数量。取消堆栈结果以获取数据透视表格式的数据。

最后,使用reindex在新的每小时间隔idx索引上设置索引,并用零填充空值。

>>> (df
     .set_index('date_arrived')
     .groupby([pd.TimeGrouper('H'), 'station_arrived'])
     .station_arrived
     .count()
     .unstack()
     .reindex(idx)
     .fillna(0)
     )
station_arrived      56  85
2010-02-02 00:00:00   0   0
2010-02-02 01:00:00   0   0
2010-02-02 02:00:00   0   0
2010-02-02 03:00:00   0   0
2010-02-02 04:00:00   0   0
2010-02-02 05:00:00   0   0
2010-02-02 06:00:00   0   0
2010-02-02 07:00:00   0   0
2010-02-02 08:00:00   0   0
2010-02-02 09:00:00   0   0
2010-02-02 10:00:00   0   0
2010-02-02 11:00:00   0   0
2010-02-02 12:00:00   0   0
2010-02-02 13:00:00   0   0
2010-02-02 14:00:00   1   0
2010-02-02 15:00:00   1   0
2010-02-02 16:00:00   0   0
...