Question

我正在为数据帧寻找一个等效于resample方法的pandas，该数据帧不是DatetimeIndex，而是一个整数数组，甚至可能是浮点数。

我知道在某些情况下（例如this one），resample方法可以通过reindex和插值轻松替换，但在某些情况下（我认为）它不能。

例如，如果我有

df = pd.DataFrame(np.random.randn(10,2))
withdates = df.set_index(pd.date_range('2012-01-01', periods=10))
withdates.resample('5D', np.std)

这给了我

                   0         1
2012-01-01  1.184582  0.492113
2012-01-06  0.533134  0.982562

但我无法使用df生成相同的结果并重新取样。所以我正在寻找能够起作用的东西

 df.resample(5, np.std)

这会给我

          0         1
0  1.184582  0.492113
5  0.533134  0.982562

这种方法存在吗？我能够创建此方法的唯一方法是手动将df分成较小的数据帧，应用np.std然后将所有内容连接起来，我觉得这很慢，而且根本不聪明。

干杯

Answer 1

设置

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

您需要创建标签以自行分组。我使用：

(df.index.to_series() / 5).astype(int)

为您提供一系列值[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...]，然后在groupby

中使用此值

您还需要指定新数据帧的索引。我使用：

df.index[4::5]

获取当前指数从第5个位置开始（因此为4），之后每5个位置。它看起来像[4, 9, 14, 19]。我可以用df.index[::5]做到这一点来获得起始位置，但我选择了结束位置。

解决方案

# assign as variable because I'm going to use it more than once.
s = (df.index.to_series() / 5).astype(int)

df.groupby(s).std().set_index(s.index[4::5])

看起来像：

           A         B
4   0.198019  0.320451
9   0.329750  0.408232
14  0.293297  0.223991
19  0.095633  0.376390

其他考虑因素

这相当于下采样。我们还没有解决抽样问题。

要更频繁地从我们生成的数据框索引返回到数据框索引，我们可以使用reindex，如下所示：

# assign what we've done above to df_down
df_down = df.groupby(s).std().set_index(s.index[4::5])

df_up = df_down.reindex(range(20)).bfill()

看起来像：

           A         B
0   0.198019  0.320451
1   0.198019  0.320451
2   0.198019  0.320451
3   0.198019  0.320451
4   0.198019  0.320451
5   0.329750  0.408232
6   0.329750  0.408232
7   0.329750  0.408232
8   0.329750  0.408232
9   0.329750  0.408232
10  0.293297  0.223991
11  0.293297  0.223991
12  0.293297  0.223991
13  0.293297  0.223991
14  0.293297  0.223991
15  0.095633  0.376390
16  0.095633  0.376390
17  0.095633  0.376390
18  0.095633  0.376390
19  0.095633  0.376390

我们也可以使用其他内容reindex来range(0, 20, 2)来提升样本甚至整数索引。

Answer 2

替代方案，这是可以做的一件事

def resample(df, rule, how=None, **kwargs):
    import pandas as pd
    if how==None:
        import numpy as np
        how = np.mean

    if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):
        return df.resample(rule, how, **kwargs)
    else:
        idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)
        aux = df.groupby(idx).apply(how)
        aux = aux.set_index(bins[:-1])
        return aux

Answer 3

@piSquared解决方案非常好，但我不喜欢在重新索引时选择每手索引。

对于每种下采样（浮点索引）也应该有效，并自动选择每个范围内的索引均值：

df = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])
df.index.name = 'crazy_index'

s = (df.index.to_series() / 10).astype(int)

现在，您可以随意选择要在每个子组中计算的功能：

# calculate std() in each group
df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

                    A         B
crazy_index
3.667539     0.276986  0.317642
14.275074    0.248700  0.372551
25.054042    0.254860  0.297586

# calculate median() in each group
df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
Out[38]:
                    A         B
crazy_index
3.667539     0.454654  0.521649
14.275074    0.451265  0.490125
25.054042    0.489326  0.622781

编辑：索引中存在一些错误，现在它是正确的＆amp;工作

Pandas相当于整数索引的重采样

3 个答案:

设置

解决方案

其他考虑因素