Question

我按照说明进行操作 http://docs.cython.org/en/latest/src/tutorial/numpy.html

但是当我尝试构建自己的块时遇到了一些问题：

（代码的目的只是计算两个三角形的联合区域）

我的.pyx代码：

cimport cython
import numpy as np
cimport numpy as np

DTYPE = np.float
ctypedef np.float_t DTYPE_t

cpdef DTYPE_t union(np.ndarray[DTYPE_t, ndim=1] au, np.ndarray[DTYPE_t, ndim=1] bu, DTYPE_t area_intersection):
    cdef DTYPE_t area_a
    cdef DTYPE_t area_b
    cdef DTYPE_t area_union
    cdef DTYPE_t a = au[2]
    cdef DTYPE_t b = au[0]
    cdef DTYPE_t c = au[3]
    cdef DTYPE_t d = au[1]
    cdef DTYPE_t e = bu[2]
    cdef DTYPE_t f = bu[0]
    cdef DTYPE_t g = bu[3]
    cdef DTYPE_t h = bu[1]
    area_a = (a - b) * (c - d)
    area_b = (e - f) * (g - h)
    area_union = area_a + area_b - area_intersection
    return area_union

我的.py代码

import numpy as np
import random


def union(au, bu,area_intersection):
    area_a = (au[2] - au[0]) * (au[3] - au[1])
    area_b = (bu[2] - bu[0]) * (bu[3] - bu[1])
    area_union = area_a + area_b - area_intersection
    return area_union

我的setup.py文件：

from distutils.core import setup
from Cython.Build import cythonize
import numpy

setup(ext_modules = cythonize('union.pyx'),include_dirs=[numpy.get_include()])

我使用以下代码测试cython的速度：

from union_py import union as py_speed
from union import union as cy_speed
import numpy as np
import time

np.random.seed(1)
start = time.time()
for i in range (1000000):
    in_a = np.random.rand(4)
    in_b = np.random.rand(4)
    c = cy_speed(au = in_a,bu = in_b,area_intersection = 2.1)

end = time.time()
print (end - start)

对于python速度，我只需将cy_speed更改为py_speed。

结果显示cython需要2.291128158569336而python需要2.0604214668273926。 python版本更快。我确保cython代码的功能（计算联合区域）是正确的。如何改进cython代码以加快速度？

Answer 1

DavidW的感觉是正确的：cython必须在运行时检查传递的数组的类型，这意味着由于函数本身的操作很少而无法恢复的开销。

numpy-array不是这项任务的最佳选择 - 正如我们所看到的，使用cdef-class可以将python击败10倍。

对于我的实验，我使用的设置略有不同：

>>> import numpy as np
>>> a=np.random.rand(4)
>>> b=np.random.rand(4)

>>> %timeit py_union(a,b,2.1)
1.3 µs ± 51.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

>>> %timeit cy_union(a,b,2.1)
1.39 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

所以cython版本确实有点慢。正如DavidW指出的那样，这是由于cython的类型检查，当我们查看生成的C代码时，在评估第一行函数之前，必须发生以下情况：

...
__Pyx_LocalBuf_ND __pyx_pybuffernd_au;
...
if (unlikely(__Pyx_GetBufferAndValidate(&__pyx_pybuffernd_au.rcbuffer->pybuffer, (PyObject*)__pyx_v_au, &__Pyx_TypeInfo_nn___pyx_t_3foo_DTYPE_t, PyBUF_FORMAT| PyBUF_STRIDES, 1, 0, __pyx_stack) == -1)) __PYX_ERR(0, 7, __pyx_L1_error)

可以找到__Pyx_GetBufferAndValidate的定义here，我们可以很容易地看到，它不是免费的。

让我们用两个实验验证它。首先减少函数中的操作次数：

%%cython
import numpy as np
cimport numpy as np

ctypedef np.float_t DTYPE_t

cpdef DTYPE_t cy_silly1(np.ndarray[DTYPE_t, ndim=1] au, np.ndarray[DTYPE_t, ndim=1] bu, DTYPE_t area_intersection):
    area_union = au[0] + bu[1] - area_intersection
    return area_union

>>> %timeit cy_silly1(a,b,2.1)
1.4 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

我们减少了函数中的操作次数，但它对执行时间没有影响，即函数的这部分不是瓶颈。

如果我们只有一个numpy-array来检查会发生什么？

%%cython
import numpy as np
cimport numpy as np

ctypedef np.float_t DTYPE_t

cpdef DTYPE_t cy_silly2(np.ndarray[DTYPE_t, ndim=1] au, DTYPE_t area_intersection):
    cdef DTYPE_t area_union = au[0] + au[1] - area_intersection
    return area_union

>>> %timeit cy_silly2(a,2.1)
745 ns ± 7.46 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

这次我们几乎获得了加速2 - __Pyx_GetBufferAndValidate真的是瓶颈。

可以做些什么？ Typed memory views的开销略低，因为它们使用完全不同的机制：

%%cython
...
cpdef DTYPE_t cy_union_tmv(DTYPE_t[::1] au, DTYPE_t[::1] bu, DTYPE_t area_intersection):
...#the same as above

%timeit cy_union_tmv(a,b,2.1)
1.09 µs ± 3.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

更好的想法是编写一个专用的cdef-class，它的开销会少得多：

import numpy as np
cimport numpy as np

DTYPE = np.float
ctypedef np.float_t DTYPE_t

cdef class Triangle:
   cdef DTYPE_t a
   cdef DTYPE_t b
   cdef DTYPE_t c
   cdef DTYPE_t d
   def __init__(self, a,b,c,d):
      self.a=a
      self.b=b
      self.c=c
      self.d=d
   cdef DTYPE_t get_area(self):
      return (self.a-self.b)*(self.c-self.d)


cpdef DTYPE_t cy_union_cdef(Triangle au, Triangle bu, DTYPE_t area_intersection):
    cdef DTYPE_t area_union = au.get_area() + bu.get_area() - area_intersection 
    return area_union

现在：

>>> tri_a=Triangle(a[0],a[1],a[2],a[3])
>>> tri_b=Triangle(b[0],b[1],b[2],b[3]) 
>>> %timeit cy_union_cdef(tri_a,tri_b,2.1)
106 ns ± 0.668 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

产生大约10的加速。

Answer 2

避免函数调用开销

您正在调用一个非常简单的函数。如果你在C中这样做，编译器希望内联这个简单的函数来避免函数调用开销（远远低于调用python函数）

我想您的in_a，in_b和in_b以及area_intersection存储在一个真实世界示例中的数组中。在这种情况下，您必须将整个数组传递给已编译的函数。

在下面的例子中，我将展示一个关于在这些任务上使用numba的简单示例，它也可以内联简单的函数。这并不是你要求的，但是会使这项工作变得更加容易，并且可以作为高效的cython实现的模板。

如前所述，随机数生成主导了基准测试的运行时。为了避免这种情况，我将在基准测试之外生成随机数。

import numpy as np
import numba as nb
import time

#comment for python Testing, don't use cache when copying the function
#to the interpreter
@nb.njit(fastmath=True,cache=True)
def union(au, bu,area_intersection):
  area_a = (au[2] - au[0]) * (au[3] - au[1])
  area_b = (bu[2] - bu[0]) * (bu[3] - bu[1])
  area_union = area_a + area_b - area_intersection
  return area_union

@nb.njit(fastmath=True,cache=True)
def Union_Arr(in_a,in_b,area_intersection):
  c=np.empty(in_a.shape[0],dtype=in_a.dtype)
  for i in range (in_a.shape[0]):
    c[i] = union(in_a[i,:],in_b[i,:],area_intersection[i])

  return c

#generating testdata
np.random.seed(1)
in_a = np.random.rand(1000000,4)
in_b = np.random.rand(1000000,4)
area_intersection = np.random.rand(1000000)

#Warm up
#even loading cached native code takes a while,
#we don't want to measure a constant overhead (about 60ms)
#in a performance critical code segment, that is called many times
c=Union_Arr(in_a,in_b,area_intersection)

start = time.time()
c=Union_Arr(in_a,in_b,area_intersection)
end = time.time()
print (end - start)

每次通话100万个三角形的结果

纯Python：1,000,000个三角形1.92秒（每个三角形交叉点1.92μs）

Numba：1,000,000个三角形为0.007s（每个三角形交叉点为7 ns）

总之，可以说避免从非编译代码调用微小函数是至关重要的。即使是来自@ead的优化函数也比上面的例子慢了很多。

我如何使用cython加速numpy？

2 个答案: