我有一个这样的数据框,有两列,date
和 indicator
:
date indicator
2019-10-26 06:48:49 -1.073525
2019-10-27 06:19:31 -0.375276
2019-10-28 06:50:44 0.643764
2019-10-29 07:21:35 0.863731
2019-10-30 07:52:36 1.022312
2019-10-31 08:23:18 1.125842
2019-11-01 08:52:35 0.863731
2019-11-02 09:16:28 0.831097
2019-11-03 09:42:20 0.529638
2019-11-04 10:09:01 -0.735926
2019-11-05 10:34:39 -1.743626
2019-11-06 11:00:39 -0.872055
想法是创建一个列 signal
,而不进行循环,其工作方式如下:
indicator
< -1 那么:
signal
为 0,则变为 1 并保持该值直到 indicator
变得积极signal
已经是 1,它不会改变indicator
> 1 那么:
signal
为 0,则变为 -1 并保持该值直到 indicator
变得消极signal
已经是 -1,它不会改变indicator
改变符号:
signal
为 -1 或 1,则变为 0signal
为 0,则不会改变所以它会给出类似的东西:
date indicator signal
2019-10-26 06:48:49 -1.073525 1
2019-10-27 06:19:31 -0.375276 1
2019-10-28 06:50:44 0.643764 0
2019-10-29 07:21:35 0.863731 0
2019-10-30 07:52:36 1.022312 -1
2019-10-31 08:23:18 1.125842 -1
2019-11-01 08:52:35 0.863731 -1
2019-11-02 09:16:28 0.831097 -1
2019-11-03 09:42:20 0.529638 -1
2019-11-04 10:09:01 -0.735926 0
2019-11-05 10:34:39 -1.743626 1
2019-11-06 11:00:39 -0.872055 1
我尝试根据指标值创建一些包含 1 和 -1 的列,然后进行差异和累积总和,但没有成功获得此确切列。
答案 0 :(得分:1)
不使用 numpy
的纯 np.vectorize
解决方案:
indicator_np = df.indicator.to_numpy()
indicator_abs_gt1 = np.abs(indicator_np)>1
np.sign(indicator_np, out=indicator_np)
signchanges = np.ediff1d(indicator_np, to_begin=0).astype(bool)
signal = np.where(
indicator_abs_gt1 | signchanges,
-indicator_np* indicator_abs_gt1,
np.nan
)
mask = np.isnan(signal) ##
idx = np.arange(mask.size) * ~mask ## Inspired from Divakar's answer -
np.maximum.accumulate(idx, out=idx) ## https://stackoverflow.com/a/41191127/5431791
df['signal'] = signal[idx].astype(int) ##
>>> df
date indicator signal
2019-10-26 06:48:49 -1.073525 1
2019-10-27 06:19:31 -0.375276 1
2019-10-28 06:50:44 0.643764 0
2019-10-29 07:21:35 0.863731 0
2019-10-30 07:52:36 1.022312 -1
2019-10-31 08:23:18 1.125842 -1
2019-11-01 08:52:35 0.863731 -1
2019-11-02 09:16:28 0.831097 -1
2019-11-03 09:42:20 0.529638 -1
2019-11-04 10:09:01 -0.735926 0
2019-11-05 10:34:39 -1.743626 1
2019-11-06 11:00:39 -0.872055 1
尽管相对于当前接受的解决方案问题中提供的样本数据的性能改进可以忽略不计,但当数据大小显着增加时,改进是巨大的。< /p>
设置
def pure_np(series):
indicator_np = series.to_numpy()
indicator_abs_gt1 = np.abs(indicator_np)>1
np.sign(indicator_np, out=indicator_np)
signchanges = np.ediff1d(indicator_np, to_begin=0).astype(bool)
signal = np.where(indicator_abs_gt1 | signchanges, -indicator_np* indicator_abs_gt1, np.nan)
mask = np.isnan(signal)
idx = np.arange(mask.size) * ~mask
np.maximum.accumulate(idx, out=idx)
return signal[idx].astype(int)
def conditions(x):
global s
if x > 1:
s = -1
elif x < -1:
s = 1
else:
if ((s == -1) & (x < 0)) | ((s == 1) & (x > 0)) :
s = 0
return s
df['signal'] = [0] * len(df)
TmSmth = np.vectorize(conditions)
基准:
>>> df.shape # sample df
(12, 2)
>>> %timeit TmSmth(df["indicator"])
45.7 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit pure_np(df["indicator"])
39.2 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ~ 1.1X speed-up
>>> df = pd.concat([df]*1_000, ignore_index=True)
>>> df.shape
(12000, 2)
>>> %timeit TmSmth(df["indicator"])
5.5 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit pure_np(df["indicator"])
265 µs ± 5.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) ~ 21X speed-up
>>> df = pd.concat([df]*1_000, ignore_index=True) # 12 million rows
>>> df.shape
(12000000, 2)
>>> %timeit TmSmth(df['indicator'])
6.43 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit pure_np(df['indicator'])
448 ms ± 58.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ~14X speed-up
答案 1 :(得分:0)
我不知道你为什么坚持不做循环,但我有一个部分解决方案:
df['signal'] = [None] * len(df)
df['signal'][df['indicator'] < -1] = 1
df['signal'][df['indicator'] > 1] = -1
loc = df.rolling(window=2).indicator.aggregate(lambda x: x.iloc[0] * x.iloc[1] < 0).fillna(0)
df['signal'][loc > 0] = 0
结果:
indicator signal
0 -1.073525 1
1 -0.375276 None
2 0.643764 0
3 0.863731 None
4 1.022312 -1
5 1.125842 -1
6 0.863731 None
7 0.831097 None
8 0.529638 None
9 -0.735926 0
10 -1.743626 1
11 -0.872055 None
从这里开始,您需要将 None
填充到之前的值,我不知道如何在不循环的情况下进行。
答案 2 :(得分:0)
我发现了一些有用的东西,即使我猜它可以优化。它基于@user4340135 此处的答案Numpy "where" with multiple conditions。我添加了一个全局变量来保留最后一个值:
def conditions(x):
global s
if x > 1:
s = -1
elif x < -1:
s = 1
else:
if ((s == -1) & (x < 0)) | ((s == 1) & (x > 0)) :
s = 0
return s
df['signal'] = [0] * len(a)
func = np.vectorize(conditions)
df['signal'] = func(df["indicator"])
答案 3 :(得分:0)
numpy.where()
可以与两个甚至更多条件部分一起使用,如下所示:
numpy.where((condion1)|(condition2))
解决 or
个问题
numpy.where((condion1)&(condition2))
对于 and
个问题
和numpy.where()
的输出可以这样使用df.iloc[out_put_of_numpy_where]
。虽然我没有放弃理解这个问题这会解决它。