Question

我想把一个时间序列的价值与另一个时间序列一起覆盖。输入系列在所有点都有值。超越时间序列将具有相同的索引（即日期），但我只想在某些日期覆盖值。我想到指定这个的方法是有一个时间序列，其中我想要覆盖该值，NaN我不想应用覆盖。

也许最好用一个简单的例子说明：

            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11      2
2013-06-01     3   NaN      3
2013-07-01     4     9      4
2013-08-01     2    97      5

# should become

            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11     11
2013-06-01     3   NaN      3
2013-07-01     4     9      9
2013-08-01     2    97     97

从示例中可以看出，我认为replace或where方法不起作用，因为替换值取决于索引位置而不依赖于输入值。因为我不止一次这样做，所以我把它放在一个函数中，我确实有一个解决方案，如下所示：

def overridets(ts, orts):
    tmp = pd.concat([ts, orts], join='outer', axis=1)
    out = tmp.apply(lambda x: x[0] if pd.isnull(x[1]) else x[1], axis=1)
    return out

问题是这个运行速度相对较慢：在我的环境中500点系列运行20到30毫秒。乘以两个500点系列需要大约200 us，所以我们说的慢了100倍。关于如何加快步伐的任何建议？

修改

在@Andy和@bmu的帮助下，我对问题的最终解决方案如下：

def overridets(ts, orts):

     ts.name = 'outts'
     orts.name = 'orts'
     tmp = pd.concat([ts, orts], join='outer', axis=1)

     out = tmp['outts'].where(pd.isnull(tmp['orts']), tmp['orts'])
     return out

我不需要inplace=True，因为它总是包含在一个返回单个时间序列的函数中。差不多50倍所以谢谢你们！

Answer 1

将列的非NaN值复制到另一列的更快方法是使用loc和boolean mask：

In [11]: df1
Out[11]:
            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11      2
2013-06-01     3   NaN      3
2013-07-01     4     9      4
2013-08-01     2    97      5

In [12]: df1.loc[pd.notnull(df1['orts']), 'outts'] = df1['orts']

In [13]: df1
Out[13]:
            ints  orts  outts
index
2013-04-01     1   NaN      1
2013-05-01     2    11     11
2013-06-01     3   NaN      3
2013-07-01     4     9      9
2013-08-01     2    97     97

这比你的功能要快得多：

In [21]: df500 = pd.DataFrame(np.random.randn(500, 2), columns=['orts', 'outts'])

In [22]: %timeit overridets(df500['outts'], df500['orts'])
100 loops, best of 3: 14 ms per loop

In [23]: %timeit df500.loc[pd.notnull(df500['orts']), 'outts'] = df500['orts']
1000 loops, best of 3: 400 us per loop

In [24]: df100k = pd.DataFrame(np.random.randn(100000, 2), columns=['orts', 'outts'])

In [25]: %timeit overridets(df100k['outts'], df100k['orts'])
1 loops, best of 3: 2.67 s per loop

In [26]: %timeit df100k.loc[pd.notnull(df100k['orts']), 'outts'] = df100k['orts']
100 loops, best of 3: 9.61 ms per loop

正如@bmu指出的那样，事实上你最好使用where：

In [31]: %timeit df500['outts'].where(pd.isnull(df500['orts']), df['orts'], inplace=True)
1000 loops, best of 3: 281 us per loop

In [32]: %timeit df100k['outts'].where(pd.isnull(df['orts']), df['orts'], inplace=True)
100 loops, best of 3: 2.9 ms per loop

Answer 2

combine_first函数内置于Pandas中并处理此问题：

In [62]:  df

Out [62]:
                ints  orts  outts
    2013-04-01     1   NaN      1
    2013-05-01     2    11     11
    2013-06-01     3   NaN      3
    2013-07-01     4     9      9
    2013-08-01     2    97     97

In [63]:
    df['outts'] =  df.orts.combine_first(df.ints)
    df

Out [63]:
                ints  orts  outts
    2013-04-01     1   NaN      1
    2013-05-01     2    11     11
    2013-06-01     3   NaN      3
    2013-07-01     4     9      9
    2013-08-01     2    97     97

这应该与之前的任何解决方案一样快......

In [64]:
    df500 = pd.DataFrame(np.random.randn(500, 2), columns=['orts', 'outts'])
    %timeit df500.orts.combine_first(df500.outts)

Out [64]:
    1000 loops, best of 3: 210 µs per loop

用另一个覆盖一个时间序列

2 个答案: