我试图以Python中最有效的方式进行随机采样,但是我感到困惑,因为使用numpy的random.choices()的速度比使用random.choices()的速度慢
import numpy as np
import random
np.random.seed(12345)
# use gamma distribution
shape, scale = 2.0, 2.0
s = np.random.gamma(shape, scale, 1000000)
meansample = []
samplesize = 500
%timeit meansample = [ np.mean( np.random.choice( s, samplesize, replace=False)) for _ in range(500)]
23.3 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit meansample = [np.mean(random.choices(s, k=samplesize)) for x in range(0,500)]
152 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23秒vs 152毫秒是很多时间
我做错了什么?
答案 0 :(得分:2)
这里有两个问题。首先,对于纯Python random
库,您可能打算使用sample
而不是choices
进行采样而不进行替换。这在一定程度上改变了基准。其次,np.random.choice
具有更好的采样替代性能,无需替换。这是与随机生成器API有关的已知issue。您可以使用np.random.Generator
获得更好的性能。我的时间安排:
%timeit meansample = [ np.mean( np.random.choice( s, samplesize, replace=False)) for _ in range(500)]
# 1 loop, best of 3: 12.4 s per loop
%timeit meansample = [np.mean(random.choices(s, k=samplesize)) for x in range(0,500)]
# 10 loops, best of 3: 118 ms per loop
sl = s.tolist()
%timeit meansample = [np.mean(random.sample(sl, k=samplesize)) for x in range(0,500)]
# 1 loop, best of 3: 219 ms per loop
g = np.random.Generator(np.random.PCG64())
%timeit meansample = [ np.mean( g.choice( s, samplesize, replace=False)) for _ in range(500)]
# 10 loops, best of 3: 25 ms per loop
因此,random.sample
在不进行替换的情况下胜过np.random.choice
,但比np.random.Generator.choice
慢。