为什么B = numpy.dot(A,x)通过做B [i,:,] = numpy.dot(A [i,:,:],x)来循环这么慢?

时间:2015-10-08 00:16:32

标签: python numpy multidimensional-array product

我得到了一些我无法解释的效率测试结果。

我想组装一个矩阵B,其第i个条目B [i,:,:] = A [i,:,:]。dot(x),其中每个A [i,:,]是一个二维矩阵,因此是x。

我可以这三种方式来测试性能我随机(numpy.random.randn)矩阵A =(10,1000,1000),x =(1000,1200)。我得到以下时间结果:

(1)单个多维点积

B = A.dot(x)

total time: 102.361 s

(2)循环通过i并执行2D点积

   # initialize B = np.zeros([dim1, dim2, dim3])
   for i in range(A.shape[0]):
       B[i,:,:] = A[i,:,:].dot(x)

total time: 0.826 s

(3)numpy.einsum

B3 = np.einsum("ijk, kl -> ijl", A, x)

total time: 8.289 s

因此,选项(2)是迄今为止最快的。但是,考虑到(1)和(2),我不会看到它们之间的巨大差异。如何循环和做2D点产品的速度要快124倍?他们都使用numpy.dot。任何见解?

我在下面列出了用于上述结果的代码:

import numpy as np
import numpy.random as npr
import time

dim1, dim2, dim3 = 10, 1000, 1200
A = npr.randn(dim1, dim2, dim2)
x = npr.randn(dim2, dim3)

# consider three ways of assembling the same matrix B: B1, B2, B3

t = time.time()
B1 = np.dot(A,x)
td1 = time.time() - t
print "a single dot product of A [shape = (%d, %d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
  % (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td1)


B2 = np.zeros([A.shape[0], x.shape[0], x.shape[1]])
t = time.time()
for i in range(A.shape[0]):
    B2[i,:,:] = np.dot(A[i,:,:], x)
td2 = time.time() - t
print "taking %d dot products of 2D dot products A[i,:,:] [shape = (%d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
  % (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td2)

t = time.time()
B3 = np.einsum("ijk, kl -> ijl", A, x)
td3 = time.time() - t
print "using np.einsum, it completes in %.3f s" % td3

2 个答案:

答案 0 :(得分:3)

使用较小的dims 10,100,200,我得到类似的排名

In [355]: %%timeit
   .....: B=np.zeros((N,M,L))
   .....: for i in range(N):
              B[i,:,:]=np.dot(A[i,:,:],x)
   .....: 
10 loops, best of 3: 22.5 ms per loop
In [356]: timeit np.dot(A,x)
10 loops, best of 3: 44.2 ms per loop
In [357]: timeit np.einsum('ijk,km->ijm',A,x)
10 loops, best of 3: 29 ms per loop

In [367]: timeit np.dot(A.reshape(-1,M),x).reshape(N,M,L)
10 loops, best of 3: 22.1 ms per loop

In [375]: timeit np.tensordot(A,x,(2,0))
10 loops, best of 3: 22.2 ms per loop

迭代更快,但不如你的情况那么多。

只要迭代维度与其他维度相比较小,这可能就是正确的。在这种情况下,与计算时间相比,迭代(函数调用等)的开销很小。并且立即执行所有值会占用更多内存。

我尝试了一个dot版本,我将A转换为2d,认为dot在内部进行了这种重塑。我很惊讶它实际上是最快的。 tensordot可能正在进行相同的重塑(如果Python可读,则为该代码)。

einsum设置一个'产品总和'迭代,涉及4个变量,i,j,k,m - dim1*dim2*dim2*dim3步骤与C级nditer。因此,您拥有的索引越多,迭代空间就越大。

答案 1 :(得分:1)

numpy.dot仅委托BLAS矩阵乘以when the inputs each have dimension at most 2

#if defined(HAVE_CBLAS)
    if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
            (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
             NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
        return cblas_matrixproduct(typenum, ap1, ap2, out);
    }
#endif

当将整个3维A数组粘贴到dot中时,NumPy会通过一个nditer对象,采用较慢的路径。 It still tries to get some use out of BLAS在慢路径中,但是在慢路径的设计方式中,它只能使用向量-向量乘法,而不能使用矩阵-矩阵乘,这不会给BLAS提供足够的优化空间

相关问题