Question

假设我已经从sql或csv（不是在python中创建）加载了时间序列数据，索引将是：

DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
               '2015-03-02 02:00:00', '2015-03-02 03:00:00',
               '2015-03-02 04:00:00', '2015-03-02 05:00:00',
               '2015-03-02 06:00:00', '2015-03-02 07:00:00',
               '2015-03-02 08:00:00', '2015-03-02 09:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)

如您所见，'freq'为None。我想知道如何检测该系列的频率并将'freq'设置为其频率。

如果可能的话，我希望这可以在数据不连续的情况下工作（系列中有很多中断）。

我试图找到两个时间戳之间所有差异的模式，但我不知道如何将其转换为系列可读的格式

Answer 1

也许尝试区分时间索引并使用模式（或最小差异）作为频率。

import pandas as pd
import numpy as np

# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df

                        col
2015-03-02 01:00:00  2.0261
2015-03-02 04:00:00  1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00  0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00  1.8453
...                     ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00  1.1962
2015-07-19 15:00:00  1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00  0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00  0.5049
2015-07-19 23:00:00 -0.5349

[2000 rows x 1 columns]

# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()

01:00:00    1181
02:00:00     499
03:00:00     180
04:00:00      93
05:00:00      24
06:00:00      10
07:00:00       9
08:00:00       3
dtype: int64

# the mode can be considered as frequency
res.index[0]  # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min()  # output: Timedelta('0 days 01:00:00')




# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng

DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
               '2015-03-02 03:00:00', '2015-03-02 04:00:00',
               '2015-03-02 05:00:00', '2015-03-02 06:00:00',
               '2015-03-02 07:00:00', '2015-03-02 08:00:00',
               '2015-03-02 09:00:00', '2015-03-02 10:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', length=3359, freq='H', tz=None)

Answer 2

找到最小时差

np.diff(data.index.values).min()

通常以ns为单位。要获得频率，请假设ns：

freq = 1e9 / np.diff(df.index.values).min().astype(int)

Answer 3

值得一提的是，如果数据是连续的，您可以使用pandas.DateTimeIndex.inferred_freq属性：

dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'

或pandas.infer_freq方法：

pd.infer_freq(dt_ix)
Out[3]: 'H'

如果不连续，pandas.infer_freq将返回None。与已经提出的方法类似，另一种方法是使用pandas.Series.diff方法：

split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')

Python Pandas：检测时间序列的频率

3 个答案: