Question

为什么以下L2范数计算之间存在如此大的速度差异：

a = np.arange(1200.0).reshape((-1,3))

%timeit [np.sqrt((a*a).sum(axis=1))]
100000 loops, best of 3: 12 µs per loop

%timeit [np.sqrt(np.dot(x,x)) for x in a]
1000 loops, best of 3: 814 µs per loop

%timeit [np.linalg.norm(x) for x in a]
100 loops, best of 3: 2 ms per loop

就我所见，这三个都产生了相同的结果。

这里是numpy.linalg.norm函数的源代码：

x = asarray(x)

# Check the default case first and handle it immediately.
if ord is None and axis is None:
    x = x.ravel(order='K')
    if isComplexType(x.dtype.type):
        sqnorm = dot(x.real, x.real) + dot(x.imag, x.imag)
    else:
        sqnorm = dot(x, x)
    return sqrt(sqnorm)

编辑：有人建议可以并行化一个版本，但我检查过，情况并非如此。所有三个版本都消耗12.5％的CPU（在我的4个物理/ 8个虚拟核心Xeon CPU上通常就是Python代码的情况。

Answer 1

np.dot通常会调用BLAS库函数 - 因此它的速度取决于您的numpy版本链接到哪个BLAS库。一般来说，我希望它有更大的常量开销，但随着阵列大小的增加，可以更好地扩展。但是，您从列表解析（实际上是普通的Python for循环）中调用它的事实可能会否定使用BLAS的任何性能优势。

如果你摆脱列表理解并使用axis= kwarg，np.linalg.norm可与你的第一个例子相比，但np.einsum比两者快得多：

In [1]: %timeit np.sqrt((a*a).sum(axis=1))
The slowest run took 10.12 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 11.1 µs per loop

In [2]: %timeit np.linalg.norm(a, axis=1)
The slowest run took 14.63 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 13.5 µs per loop

# this is what np.linalg.norm does internally
In [3]: %timeit np.sqrt(np.add.reduce(a * a, axis=1))
The slowest run took 34.05 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 10.7 µs per loop

In [4]: %timeit np.sqrt(np.einsum('ij,ij->i',a,a))
The slowest run took 5.55 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.42 µs per loop

类似代码之间的巨大速度差异

1 个答案: