Question

我有许多不同大小的小numpy数组（组），我想尽可能快地连接这些组的任意子集。我最初想出的解决方案是将这些组存储为np.arrays的np.array，然后使用列表索引访问组的子集：

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size=size))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size=1000)

%%timeit
np.concatenate(groups[indices])
>>> 204 µs ± 395 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

但是，由于组很小（平均2个元素），因此此解决方案在内存消耗方面效率低下，并且我必须为每个组存储一个numpy数组结构，几乎是100个字节（对我来说太多了）。 / p>

为了使解决方案更有效，我决定将所有组连接起来并将数组边界存储在单独的数组中

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])
# ith group is data[offsets[i]: offsets[i + 1]]

但是，串联根本不明显。像这样：

%%timeit
np.concatenate([data[offsets[i]: offsets[i + 1]] for i in indices])
>>> 1.02 ms ± 44.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

工作速度比原始解决方案慢5倍。我认为这是由于两件事。首先，对numpy数组索引进行迭代（python将c-int封装到每个索引的对象中）。其次，python为每个切片/索引创建numpy结构。我认为在纯python中减少这种情况的连接时间是不可能的，因此我决定提出一个cython解决方案。

%%cython
import numpy as np
ctypedef long long int64

def concatenate(data, offsets, indices):
    cdef int64[::] data_view = data
    cdef int64[::] indices_view = indices
    cdef int64[::] offsets_view = offsets
    
    size = np.sum(offsets[indices + 1]) - np.sum(offsets[indices])
    res = np.zeros(size, dtype=np.int64)
    cdef int64[::] res_view = res
    
    cdef int64 i, l = 0, r
    for i in indices_view:
        r = l + offsets_view[i + 1] - offsets_view[i]
        res_view[l: r] = data_view[offsets_view[i]: offsets_view[i + 1]]
        l = r
    return res

%%timeit
concatenate(data, offsets, indices)
>>> 277 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

此解决方案比以前的解决方案要快，但仍比原始解决方案要慢一些。但是最大的问题是我事先不知道数据类型。我在示例中使用了int64，但是它可以是任何数字类型，例如float32。因此，我不能像以前那样使用类型化的内存视图。从理论上讲，我只需要知道类型的大小（4/8字节），并且如果我有指向数据和结果数组的指针，就可以使用memcpy或类似的东西来复制切片。但是我不知道如何在cython中执行此操作。有办法吗？

Answer 1

这是我纯的仅用于numpy的解决方案，adv_concatenate()函数。与常规15x-47x相比，它提供了np.concatenate()倍的加速（在不同的计算机上有所不同）。

注意：在第一个代码之后，还有第二个更快的解决方案。

要测量用过的pip模块timerit的时间，请通过python -m pip install timerit安装一次。在计时方面，使用了两种类型的计算机-第一台计算机基于Windows，所有测试都相同，这是我的家用笔记本电脑，第二台计算机基于Linux，每种测试都使用不同的机器（因此不同测试之间的速度不同，但在一次运行/测试中却保持相同的速度），这也是我用来测试代码的repl.it网站的机器。

算法的想法是使用numpy的累积和函数（.cumsum()）：

我们创建的数组仅1，其大小等于结果连接数据数组的总大小。此数组将保存要提取的所有data元素的索引，以创建结果数据数组。
在每个块的开始位置处的数组中（小子数组），我们将值更改为，使得在整个结果数组上运行cumsum()之后，该起始值将转换为起始偏移量为{{1 }}数组。剩余的值仍为data s。
我们对此数组应用1。现在，所有值都将保留要提取的数据元素的正确索引。
我们仅通过索引.cumsum()并在上面的索引数组中形成索引，就形成了获取的数据数组。

如果我们预先计算一些data或offsets[1:] - offsets[:-1]之类的值并在offsets[:-1] + 1函数中使用它们，则可以进一步提高该算法的效率。

Try it online!

adv_concatenate()

输出：

在第一台计算机上（Windows）：

# Needs: python -m pip install numpy timerit
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])

timer = lambda: Timerit(num = 600, verbose = 1)

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

def adv_concatenate(data, offsets, indices):
    begs, ends = offsets[indices], offsets[indices + 1]
    lens = ends - begs
    clens = lens.cumsum()
    ix = np.ones((clens[-1],), dtype = offsets.dtype)
    ix[0] = begs[0]
    ix[clens[:-1]] = begs[1:] - ends[:-1] + 1
    ix = ix.cumsum()
    return data[ix]
    
print('adv_concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv = adv_concatenate(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct

print('speedup:', round(tref / tadv, 3))

在第二台计算机（Linux）上：

np.concatenate(): Timed best=3.129 ms, mean=3.225 +- 0.1 ms
adv_concatenate(): Timed best=191.137 us, mean=208.012 +- 20.7 us
speedup: 15.504

与常规

np.concatenate(): Timed best=1.666 ms, mean=2.314 +- 0.4 ms
adv_concatenate(): Timed best=35.596 us, mean=48.680 +- 15.4 us
speedup: 47.532

相比，

第二种解决方案的速度甚至比第一个解决方案快40x-150x倍（在不同的计算机上有所不同）。但是第二种解决方案使用基于Numba JIT LLVM的编译器，该编译器需要通过np.concatenate()安装。

尽管它使用了额外的python -m pip install numba包，但中心函数numba非常简单，与第一个解决方案中的代码行数相同。算法也很简单，只有两个简单的循环。

当前解决方案适用于任何数据类型，因为中央函数仅计算结果索引，因此根本不适用于adv_concatenate_indexes_numba()的dtype。此外，如果代替计算索引，numba函数将计算直接的结果数据数组，则可以进一步提高data来解决当前问题，但这仅适用于numba支持的非常简单的数据类型，包括所有数字类型。 Here is the code（或here）针对此改进的解决方案，可实现最高10%-90%的加速！在第二台计算机（Linux）上此改进版本的时间：

250x

下一步是更通用的（仅索引计算）解决方案的代码：

Try it online!

np.concatenate(): Timed best=1.640 ms, mean=3.403 +- 1.9 ms
adv_concatenate_numba(): Timed best=12.669 us, mean=17.235 +- 6.9 us
speedup: 197.46

输出：

在第一台计算机上（Windows）：

# Needs: python -m pip install numpy numba timerit
from timerit import Timerit
import numpy as np, numba
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)

timer = lambda: Timerit(num = 600, verbose = 1)

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

@numba.njit('i8[:](i8[:], i8[:])', cache = True)
def adv_concatenate_indexes_numba(offsets, indices):
    tlen = 0
    for i in range(indices.size):
        ix = indices[i]
        tlen += offsets[ix + 1] - offsets[ix]
        
    pos, r = 0, np.empty((tlen,), dtype = offsets.dtype)
    for i in range(indices.size):
        ix = indices[i]
        for j in range(offsets[ix], offsets[ix + 1]):
            r[pos] = j
            pos += 1
            
    return r
    
def adv_concatenate2(data, offsets, indices):
    return data[adv_concatenate_indexes_numba(offsets, indices)]
    
adv_concatenate2(data, offsets, indices) # Once pre-compile Numba
print('adv_concatenate2(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv = adv_concatenate2(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct

print('speedup:', round(tref / tadv, 3))

在第二台计算机（Linux）上：

np.concatenate(): Timed best=3.201 ms, mean=3.356 +- 0.1 ms
adv_concatenate2(): Timed best=79.681 us, mean=82.991 +- 6.7 us
speedup: 40.442

受@pavelgramovich answer的Cython代码启发，我还决定使用循环（func np.concatenate(): Timed best=1.541 ms, mean=2.220 +- 0.7 ms adv_concatenate2(): Timed best=12.012 us, mean=14.830 +- 4.8 us speedup: 149.716）而不是concatenate1()版本（func memcpy() ），对于当前的测试数据，简化版本似乎比memcpy版本快concatenate0()。完整的代码将以下两个版本进行了比较：

Try it online!

1.5-2x

输出：

第一台计算机（Windows）：

# Needs: python -m pip install numpy timerit cython setuptools
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)

timer = lambda: Timerit(num = 600, verbose = 1)

def compile_cy_cats():
    src = """
import numpy as np
cimport numpy as np
cimport cython
from libc.string cimport memcpy 

@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate0(np.ndarray data, np.ndarray offsets, np.ndarray indices):
    data = np.ascontiguousarray(data)
    start_offsets = np.ascontiguousarray(offsets[indices], dtype=np.int64)
    end_offsets = np.ascontiguousarray(offsets[indices + 1], dtype=np.int64)
    cdef np.int64_t[::1] coffsets = start_offsets
    cdef np.int64_t[::1] csizes = end_offsets - start_offsets
    
    cdef np.int64_t i, total_size = 0
    for i in range(csizes.shape[0]):
        total_size += csizes[i]
    res = np.empty(total_size, dtype=data.dtype)

    cdef np.ndarray cdata = data
    cdef np.ndarray cres = res
    
    cdef np.int64_t itemsize = data.itemsize
    cdef np.int64_t res_offset = 0
    for i in range(csizes.shape[0]):
        memcpy(cres.data + res_offset * itemsize, 
               cdata.data + coffsets[i] * itemsize, 
               csizes[i] * itemsize)
        res_offset += csizes[i]
    
    return res

@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate1(np.int64_t[:] data, np.int64_t[:] offsets, np.int64_t[:] indices):
    cdef np.int64_t tlen = 0, pos = 0, ix = 0, ixs = indices.size, i = 0, j = 0
    
    for i in range(ixs):
        ix = indices[i]
        tlen += offsets[ix + 1] - offsets[ix]
        
    r = np.empty(tlen, dtype = np.int64)
    cdef np.int64_t[:] cr = r, cdata = data

    for i in range(ixs):
        ix = indices[i]
        for j in range(offsets[ix], offsets[ix + 1]):
            cr[pos] = cdata[j]
            pos += 1
    
    return r
    """
    
    srcb = src.encode('utf-8')
    
    import hashlib, os, glob, importlib
    srch = hashlib.sha256(srcb).hexdigest().upper()[:8]

    if len(glob.glob(f'cy{srch}*')) == 0:
        with open(f'cys{srch}.pyx', 'wb') as f:
            f.write(srcb)

        import sys
        from setuptools import setup, Extension
        from Cython.Build import cythonize
        import numpy as np

        sys.argv += ['build_ext', '--inplace']
        setup(
            ext_modules = cythonize(
                Extension(f'cy{srch}', [f'cys{srch}.pyx']), language_level = 3, annotate = True,
            ),
            include_dirs = [np.get_include()],
        )
        del sys.argv[-2:]

    print('Cython module:', f'cy{srch}')
    return importlib.import_module(f'cy{srch}')

cy_cats = compile_cy_cats()
concatenate0, concatenate1 = cy_cats.concatenate0, cy_cats.concatenate1

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

concatenate0(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate0(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv0 = concatenate0(data, offsets, indices)
tadv0 = tim.mean()
assert np.array_equal(ref, adv0) # Check that our solution is correct

print('speedup:', round(tref / tadv0, 3))

concatenate1(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate1(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv1 = concatenate1(data, offsets, indices)
tadv1 = tim.mean()
assert np.array_equal(ref, adv1) # Check that our solution is correct

print('speedup:', round(tref / tadv1, 3))

第二台计算机（Linux）：

Cython module: cy0BEBA0C8
np.concatenate(): Timed best=3.184 ms, mean=3.263 +- 0.1 ms
cy_concatenate0(): Timed best=119.767 us, mean=128.688 +- 10.7 us
speedup: 25.354
cy_concatenate1(): Timed best=86.525 us, mean=93.699 +- 20.5 us
speedup: 34.821

连接numpy数组切片的最快方法

1 个答案: