通过矢量化从另一列变体创建的数据框列

时间:2021-02-28 02:55:36

标签: python python-3.x dataframe vectorization

我有一个这样的数据框,有两列,dateindicator :

date                  indicator 
2019-10-26 06:48:49   -1.073525
2019-10-27 06:19:31   -0.375276
2019-10-28 06:50:44    0.643764
2019-10-29 07:21:35    0.863731
2019-10-30 07:52:36    1.022312
2019-10-31 08:23:18    1.125842
2019-11-01 08:52:35    0.863731
2019-11-02 09:16:28    0.831097
2019-11-03 09:42:20    0.529638
2019-11-04 10:09:01   -0.735926
2019-11-05 10:34:39   -1.743626
2019-11-06 11:00:39   -0.872055

想法是创建一个列 signal,而不进行循环,其工作方式如下:

  • 如果 indicator < -1 那么:
    • 如果 signal 为 0,则变为 1 并保持该值直到 indicator 变得积极
    • 如果 signal 已经是 1,它不会改变
  • 如果 indicator > 1 那么:
    • 如果 signal 为 0,则变为 -1 并保持该值直到 indicator 变得消极
    • 如果 signal 已经是 -1,它不会改变
  • 如果 indicator 改变符号:
    • 如果 signal 为 -1 或 1,则变为 0
    • 如果 signal 为 0,则不会改变

所以它会给出类似的东西:

date                  indicator    signal 
2019-10-26 06:48:49   -1.073525      1
2019-10-27 06:19:31   -0.375276      1
2019-10-28 06:50:44    0.643764      0
2019-10-29 07:21:35    0.863731      0 
2019-10-30 07:52:36    1.022312     -1
2019-10-31 08:23:18    1.125842     -1
2019-11-01 08:52:35    0.863731     -1 
2019-11-02 09:16:28    0.831097     -1 
2019-11-03 09:42:20    0.529638     -1
2019-11-04 10:09:01   -0.735926      0  
2019-11-05 10:34:39   -1.743626      1
2019-11-06 11:00:39   -0.872055      1

我尝试根据指标值创建一些包含 1 和 -1 的列,然后进行差异和累积总和,但没有成功获得此确切列。

4 个答案:

答案 0 :(得分:1)

不使用 numpy 的纯 np.vectorize 解决方案:

indicator_np = df.indicator.to_numpy()
indicator_abs_gt1 = np.abs(indicator_np)>1
np.sign(indicator_np, out=indicator_np)
signchanges = np.ediff1d(indicator_np, to_begin=0).astype(bool)
signal = np.where(
    indicator_abs_gt1 | signchanges, 
    -indicator_np* indicator_abs_gt1, 
    np.nan
)
mask = np.isnan(signal)                   ##
idx = np.arange(mask.size) * ~mask        ##  Inspired from Divakar's answer -
np.maximum.accumulate(idx, out=idx)       ##  https://stackoverflow.com/a/41191127/5431791
df['signal'] = signal[idx].astype(int)    ##

>>> df
date                  indicator    signal 
2019-10-26 06:48:49   -1.073525      1
2019-10-27 06:19:31   -0.375276      1
2019-10-28 06:50:44    0.643764      0
2019-10-29 07:21:35    0.863731      0 
2019-10-30 07:52:36    1.022312     -1
2019-10-31 08:23:18    1.125842     -1
2019-11-01 08:52:35    0.863731     -1 
2019-11-02 09:16:28    0.831097     -1 
2019-11-03 09:42:20    0.529638     -1
2019-11-04 10:09:01   -0.735926      0  
2019-11-05 10:34:39   -1.743626      1
2019-11-06 11:00:39   -0.872055      1

尽管相对于当前接受的解决方案问题中提供的样本数据的性能改进可以忽略不计,但当数据大小显着增加时,改进是巨大的。< /p>

设置

def pure_np(series):
    indicator_np = series.to_numpy()
    indicator_abs_gt1 = np.abs(indicator_np)>1
    np.sign(indicator_np, out=indicator_np)
    signchanges = np.ediff1d(indicator_np, to_begin=0).astype(bool)
    signal = np.where(indicator_abs_gt1 | signchanges, -indicator_np* indicator_abs_gt1, np.nan)
    mask = np.isnan(signal)
    idx = np.arange(mask.size) * ~mask
    np.maximum.accumulate(idx, out=idx)
    return signal[idx].astype(int)

def conditions(x):
    global s
    if x > 1:
        s = -1
    elif x < -1:
        s = 1
    else:
        if ((s == -1) & (x < 0)) | ((s == 1) & (x > 0)) :
            s = 0
    return s
df['signal'] = [0] * len(df)
TmSmth = np.vectorize(conditions)

基准

>>> df.shape    # sample df
(12, 2)

>>> %timeit TmSmth(df["indicator"])
45.7 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit pure_np(df["indicator"])
39.2 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  ~ 1.1X speed-up

>>> df = pd.concat([df]*1_000, ignore_index=True)
>>> df.shape
(12000, 2)

>>> %timeit TmSmth(df["indicator"])
5.5 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit pure_np(df["indicator"])
265 µs ± 5.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   ~ 21X speed-up


>>> df = pd.concat([df]*1_000, ignore_index=True)   # 12 million rows
>>> df.shape
(12000000, 2)

>>> %timeit TmSmth(df['indicator'])
6.43 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pure_np(df['indicator'])
448 ms ± 58.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)       ~14X speed-up

答案 1 :(得分:0)

我不知道你为什么坚持不做循环,但我有一个部分解决方案

df['signal'] = [None] * len(df)
df['signal'][df['indicator'] < -1] = 1
df['signal'][df['indicator'] > 1] = -1

loc = df.rolling(window=2).indicator.aggregate(lambda x: x.iloc[0] * x.iloc[1] < 0).fillna(0)
df['signal'][loc > 0] = 0

结果:

    indicator signal
0   -1.073525      1
1   -0.375276   None
2    0.643764      0
3    0.863731   None
4    1.022312     -1
5    1.125842     -1
6    0.863731   None
7    0.831097   None
8    0.529638   None
9   -0.735926      0
10  -1.743626      1
11  -0.872055   None

从这里开始,您需要将 None 填充到之前的值,我不知道如何在不循环的情况下进行。

答案 2 :(得分:0)

我发现了一些有用的东西,即使我猜它可以优化。它基于@user4340135 此处的答案Numpy "where" with multiple conditions。我添加了一个全局变量来保留最后一个值:

def conditions(x):
    global s
    if x > 1:
        s = -1
    elif x < -1:
        s = 1
    else:
        if ((s == -1) & (x < 0)) | ((s == 1) & (x > 0)) :
            s = 0
    return s
        
        
df['signal'] = [0] * len(a)
func = np.vectorize(conditions)
df['signal'] = func(df["indicator"])

答案 3 :(得分:0)

numpy.where() 可以与两个甚至更多条件部分一起使用,如下所示:

numpy.where((condion1)|(condition2)) 解决 or 个问题

numpy.where((condion1)&(condition2)) 对于 and 个问题

numpy.where()的输出可以这样使用df.iloc[out_put_of_numpy_where]。虽然我没有放弃理解这个问题这会解决它。

相关问题