Question

我有以下结构的功能，

HTTPS

我现在正试图在我的GPU上运行。到目前为止我尝试的是以非矢量化形式编写函数（即分别处理X的每个条目），并将返回数组作为输入传递：

@numba.jit(nopython = True)
def foo(X,N):
    '''
    :param X: 1D numpy array
    :param N: Integer
    :rtype: 2D numpy array of shape len(X) x N
    '''
    out = np.ones((len(X),N))
    out[:,0]  = X 
    for i in range(1,N):
        out[:,i] = X**i+out[:,i-1] 
    return out

但是，我不知道用于该功能的装饰器。如果我使用

def foo_cuda(x,N,out): ''' :param x: Scalar :param N: Integer :rtype: 1D numpy array of length N ''' out[0] = x for i in range(1,N): out[i] = x**i+out[i-1]我得到@numba.vectorize([(float64,int64,float64[:])],target = 'cuda')
TypeError: Buffer dtype cannot be buffer我得到@numba.guvectorize([(float64,int64,float64[:])],'(),()->(n)',target = 'cuda')

为我的目的使用什么是正确的装饰器？

我希望能够以与NameError: undefined output symbols: n大致相同的方式调用foo_cuda，即传递1D数组foo，整数X和填充结果的2D数组N。

更新

我的函数的out版本将是

numpy.vectorize

但是，我无法在numba中创建输出数组（def foo_np(x,N): ''' :param x: Scalar :param N: Integer :rtype: 1D numpy array of length N ''' out = np.zeros(N) out[0] = x for i in range(1,N): out[i] = x**i+out[i-1] return out foo_ve = np.vectorize(foo_np,signature='(),()->(n)')）（out = np.zeros(N)也会失败），这使我无法使用cuda.local.array(N,dtype=float64)。我尝试通过将输出数组传递给函数并将其添加到签名（参见上面的尝试1.）来修复此问题，但是我收到了错误。

更新2

实际功能如下：

@numba.vectorize('void(float64,int64)',target='cuda')

Answer 1

我挖了一下，所以我会分享我所拥有的东西。不管它是否是一个完整的答案，我不确定，但它可能会解决你的一些问题。

对于这个问题，最好的基于numba的方法可能就是编写自己的＆＃34; custom＆＃34;使用numba CUDA（jit）的CUDA内核。一个例子是here用于减少或here用于矩阵乘法。要正确地做到这一点，需要学习一些关于CUDA编程的知识。然而，这似乎并不是你想要的方向。

作为替代方案，您的问题是关注使用numba矢量化来生成GPU代码。 numba vectorize装饰器用于对标量输入和输出进行操作的函数，矢量化将它们应用于矩阵输入/矩阵输出。

对于不适合此功能的功能，例如：那些在矢量或标量上运算但产生矢量的矢量，或者在一个或两个矢量上运算并产生矢量或标量输出的那些，numba提供了广义guvectorize。

从您最简单的示例开始，我们可以通过guvectorize实现这一点：

到目前为止我尝试的是以非矢量化形式编写函数（即分别处理X的每个条目），并将返回数组作为输入传递：

def foo_cuda(x,N,out):
    '''
    :param x: Scalar
    :param N: Integer
    :rtype: 1D numpy array of length N
    '''
    out[0] = x
    for i in range(1,N):
        out[i] = x**i+out[i-1]

这个意图，采用标量（每个函数调用）并返回一个向量（每个函数调用）属于guvectorize的能力（有一些限制/警告 - 请参见底部的注释）。

以下是一个工作示例，源自示例代码here：

# cat t2.py
from __future__ import print_function

import sys

import numpy as np

from numba import guvectorize, cuda

if sys.version_info[0] == 2:
    range = xrange


#    function type:
#        - has void return type
#
#    signature: (n)->(n)
#        - the function takes an array of n-element and output same.

@guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda')
def my_func(inp, out):
        tmp1 = 0.
        tmp = inp[0]
        for i in range(out.shape[0]):
            tmp1 += tmp
            out[i] = tmp1
            tmp *= inp[0]

# set up input data
rows = 1280000 # shape[0]
cols = 4   # shape[1]
inp = np.zeros(rows*cols, dtype=np.float32).reshape(rows, cols)
for i in range(inp.shape[0]):
    inp[i,0] = (i%4)+1
# invoke on CUDA with manually managed memory

dev_inp = cuda.to_device(inp)             # alloc and copy input data

my_func(dev_inp, dev_inp)             # invoke the gufunc

dev_inp.copy_to_host(inp)                 # retrieve the result

# print out
print('result'.center(80, '-'))
print(inp)

# nvprof --print-gpu-trace python t2.py
==4773== NVPROF is profiling process 4773, command: python t2.py
-------------------------------------result-------------------------------------
[[  1.   2.   3.   4.]
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 ...
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 [  4.  20.  84. 340.]]
==4773== Profiling application: python t2.py
==4773== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
994.08ms  5.5731ms                    -               -         -         -         -  19.531MB  3.4224GB/s    Pageable      Device  Tesla P100-PCIE         1         7  [CUDA memcpy HtoD]
1.00083s  159.20us          (20000 1 1)        (64 1 1)        22        0B        0B         -           -           -           -  Tesla P100-PCIE         1         7  cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>) [48]
1.00100s  4.8017ms                    -               -         -         -         -  19.531MB  3.9722GB/s      Device    Pageable  Tesla P100-PCIE         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

# nvprof --metrics gst_efficiency python t2.py
==4787== NVPROF is profiling process 4787, command: python t2.py
==4787== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)" (done)
-------------------------------------result-------------------------------------
[[  1.   2.   3.   4.]
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 ...
 [  2.   6.  14.  30.]
 [  3.  12.  39. 120.]
 [  4.  20.  84. 340.]]
==4787== Profiling application: python t2.py
==4787== Profiling result:
==4787== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla P100-PCIE-16GB (0)"
    Kernel: cudapy::__main__::__gufunc_my_func$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=2, A, mutable, aligned>)
          1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
#

在guvectorize：

的上下文中回答您的具体问题

为我的目的使用什么是正确的装饰器？

函数类型规范如下所示：

['<return-type>(<parameter 0 type>, <parameter 1 type>, ...)']

对于guvectorize，返回类型应始终为void。参数的类型应与您要使用的函数类型相匹配。使用guvectorize我们将类型视为＆＃34; slice＆＃34;我们将通过的实际输入和输出数据类型，＆＃34;切片＆＃34;是一个单独的函数调用将运行。然后，矢量化将各个函数调用应用于每个＆＃34;切片＆＃34;输入/输出数据的大小，以覆盖输入/输出数据集的整个大小。对于我的例子，那么，我建议传递一个输入向量（float32[:]）和一个输出向量（float32[:]）。

函数签名显示输入的尺寸，后跟输出的尺寸：

(x)...->(x)

每个都可以是多维的（尽管仍应仅表示输入/输出的＆＃34;切片＆＃34;用于矢量化），标量可以由()表示。这里出现了皱纹，因为我们想要输出＆＃34;切片＆＃34;成为一个长度的结果向量，比如n。 numba guvectorize似乎不允许指定不属于已指定输入维度的输出维度（n）。所以虽然这个函数只需要一个标量输入，但我选择通过传递一个向量＆＃34; slice＆＃34;来解决这个问题。用于输入和输出。事实上，这个功能，我编写它的方式，可以使用相同的数据进行输入和输出，因此实际上并没有“开销”＃34;对于这种解决方法，针对这种特殊情况。

关于实施/绩效的一些注意事项：

这＆＃34;并行化＆＃34;横跨第一个数组（shape[0]）维度。 所有计算都在GPU 上执行。执行输出的每个矢量切片的计算由GPU上的单个线程执行。但对于大型数据集（第一维），这将为GPU提供大量并行工作（线程）。第二维中的工作，即循环，在每个线程的上下文中操作。尝试在第二维中进行并行化（创建例如前缀和）几乎肯定不可能使用vectorize或guvectorize，但应该可以使用numba cuda（jit）。我们可以通过研究上面工作示例中的（第一个）nvprof --print-gpu-trace输出来确认并行化维度，并注意它构成20000个块，每个64个线程，总共1280000个线程，匹配我们的第一个数组维度。 / p>
如上所述，这个实现有点hacky，因为我传递矢量输入，即使我只需要并使用标量。这是为了解决numba中似乎有限制的问题，据我所知，你无法指定像()->(n)这样的维度签名。（注意：经过进一步研究，我认为在这里做的正确的事情是定义一个仅输入 ufunc，将输入向量/矩阵作为一个参数而不是两次传递，并使用{ {1}}作为维度签名，而不是(n)。请参阅here）。
从内存访问模式的角度来看，这种实现并不是最优的。从（第二个）(n)->(n)输出中可以看出这一点。我认为其原因也可能是numba设计的限制（目前）。当numba在此示例中呈现一个数组，用于通过nvprof --metrics gst_efficiency进行分发/并行化时，它将数组切成行，并将每行传递给函数调用进行处理。对于这个特定的例子，一个更有效的实现是转置我们的数组存储系统，并将数组切成列，并让每个函数调用在列上工作。其原因与GPU行为和设计有关，我在这里不会复习。但是，该度量标准表明，使用此每线程访问模式，我们只能实现25％的效率。使用转置数据和每线程列方法可以轻松实现100％的效率，但我不知道如何使用numba guvectorize来实现这一点。我知道解决低效访问的唯一选择是恢复编写numba CUDA内核（即CUDA jit）。（我试图调查的另一个选择是看看我们是否可以指示ufunc在其签名中采用列向量，例如像guvectorize这样的东西，希望numba会按列切片，但没有运气。）

使用前面的示例作为模型，我们可以创建类似的东西来解决您的实际功能＆＃34;在Update 2中，使用void(float32[[:]],float32[[:]])。我已将此处的数组大小减少到5（= guvectorize）4（= len(X)）：

This answer进一步讨论了＆＃34;切片＆＃34;因为它们与# cat t3.py from __future__ import print_function import sys import numpy as np import numba from numba import guvectorize, cuda import math if sys.version_info[0] == 2: range = xrange @numba.jit(nopython = True) def foo(X,N): ''' :param X: 1D numpy array :param N: Integer >= 2 :rtype: 2D numpy array of shape len(X) x N ''' out = np.ones((X.shape[0],N)) out[:,1] = X for i in range(2,N): out[:,i] = X*out[:,i-1] - (i-1)*out[:,i-2] c = 1 for i in range(2,N):#Note that this loop cannot be combined with the one above! c *= i out[:,i] /= math.sqrt(c) return out # function type: # - has void return type # # signature: (n)->(n) # - the function takes an array of n-element and output same. @guvectorize(['void(float32[:], float32[:])'], '(n) ->(n)', target='cuda') def my_func(inp, out): for i in range(2,out.shape[0]): out[i] = out[1]*out[i-1] - (i-1)*out[i-2] c = 1. for i in range(2,out.shape[0]): c *= i out[i] /= math.sqrt(c) # set up input data rows = 5 # shape[0] cols = 4 # shape[1] inp = np.ones(rows*cols, dtype=np.float32).reshape(rows, cols) for i in range(inp.shape[0]): inp[i,1] = i # invoke on CUDA with manually managed memory dev_inp = cuda.to_device(inp) # alloc and copy input data my_func(dev_inp, dev_inp) # invoke the gufunc dev_inp.copy_to_host(inp) # retrieve the result # print out print('gpu result'.center(80, '-')) print(inp) rrows = rows rcols = cols rin = np.zeros(rrows, dtype=np.float32) for i in range(rin.shape[0]): rin[i] = i rout = foo(rin, rcols) print('cpu result'.center(80, '-')) print(rout) # python t3.py -----------------------------------gpu result----------------------------------- [[ 1. 0. -0.70710677 -0. ] [ 1. 1. 0. -0.8164966 ] [ 1. 2. 2.1213202 0.8164966 ] [ 1. 3. 5.656854 7.3484693 ] [ 1. 4. 10.606602 21.22891 ]] -----------------------------------cpu result----------------------------------- [[ 1. 0. -0.70710678 -0. ] [ 1. 1. 0. -0.81649658] [ 1. 2. 2.12132034 0.81649658] [ 1. 3. 5.65685425 7.34846923] [ 1. 4. 10.60660172 21.2289111 ]] #和vectorize有关，并且会介绍一些示例。

针对CUDA的Numba.vectorize：返回数组的正确签名是什么？

1 个答案: