Question

通常，在将Python和C代码粘合在一起时，需要将Python列表转换为连续内存，例如： array.array。这个转换步骤变得瓶颈，这也并不罕见，所以我发现自己在使用Cython时做的很蠢，因为它比内置的Python解决方案更快。

例如，要将Python列表lst转换为int32连续内存，我知道两种可能性：

a=array.array('i', lst)

和

a=array.array('i'); 
a.fromlist(lst)

然而它们都比以下cython版本慢：

%%cython
import array
from cpython cimport array
def array_from_list_iter(lst):
    cdef Py_ssize_t n=len(lst)
    cdef array.array res=array.array('i')
    cdef int cnt=0
    array.resize(res, n)  #preallocate memory
    for i in lst:
       res.data.as_ints[cnt]=i
       cnt+=1
    return res

我的时间显示（Linux，Python3.6但结果与Windows和/或Python2.7非常相似），cython-solution的速度提高了大约6倍：

Size       new_array   from_list  cython_iter    factor
1             284ns    347ns        176ns           1.6
10            599ns    621ns        209ns           2.9
10**2         3.7µs    3.5µs        578ns           6.1
10**3        38.5µs    32µs         4.3µs           7.4
10**4         343µs    316µs       40.4µs           7.8
10**5         3.5ms    3.4ms        481µs           7.1
10**6        34.1ms    31.5ms       5.0ms           6.3
10**7         353ms    316ms       53.3ms           5.9

由于我对CPython的理解有限，我会说from_list - 解决方案使用此build-in function：

static PyObject *
array_array_fromlist(arrayobject *self, PyObject *list)
{
    Py_ssize_t n;

    if (!PyList_Check(list)) {
        PyErr_SetString(PyExc_TypeError, "arg must be list");
        return NULL;
    }
    n = PyList_Size(list);
    if (n > 0) {
        Py_ssize_t i, old_size;
        old_size = Py_SIZE(self);
        if (array_resize(self, old_size + n) == -1)
            return NULL;
        for (i = 0; i < n; i++) {
            PyObject *v = PyList_GetItem(list, i);
            if ((*self->ob_descr->setitem)(self,
                            Py_SIZE(self) - n + i, v) != 0) {
                array_resize(self, old_size);
                return NULL;
            }
        }
    }
    Py_RETURN_NONE;
}

a=array.array('i', lst) grows dynamically并且需要重新分配，这样可以解释一些减速（但正如测量显示的那样，不是很多！），但是array_fromlist预先分配了所需的内存 - 它基本上与Cython代码完全相同。

所以问题：为什么这个Python代码比Cython代码慢6倍？我错过了什么？

以下是测量时间的代码：

import array
import numpy as np
for n in [1, 10,10**2, 10**3, 10**4, 10**5, 10**6, 10**7]:
    print ("N=",n)
    lst=list(range(n))
    print("python:")
    %timeit array.array('i', lst)
    print("python, from list:")
    %timeit a=array.array('i'); a.fromlist(lst)
    print("numpy:")
    %timeit np.array(lst, dtype=np.int32)
    print("cython_iter:")
    %timeit array_from_list_iter(lst)

numpy-solution比python版本慢了2倍。

Answer 1

最大的区别似乎是实际的int拆箱。 CPython数组实现使用PyArg_Parse而cython正在调用PyLong_AsLong - 至少我认为，通过几层宏。

%%cython -a
from cpython cimport PyArg_Parse
def arg_parse(obj):
    cdef int i
    for _ in range(100000):
        PyArg_Parse(obj, "i;array item must be integer", &i)
    return i

def cython_parse(obj):
    cdef int i
    for _ in range(100000):
        i = obj
    return i

%timeit arg_parse(1)
# 2.52 ms ± 67.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit cython_parse(1)
# 299 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

为什么内置array.fromlist（）比cython-code慢？

1 个答案: