Numba和guvectorize实现CUDA目标:代码运行速度慢于预期

时间:2018-08-27 19:49:59

标签: python performance cuda numba nvprof

重要信息

  • 大型数据集(1000万x 5),(200 x 1000万x 5)
  • 多为脾气
  • 每次跑步后需要更长的时间
  • 使用Spyder3
  • Windows 10

第一件事是尝试通过以下功能使用guvectorize。我传入了一堆numpy数组,并尝试使用它们在两个数组之间进行乘法运算。如果与cuda以外的目标一起运行,这将起作用。但是,切换到cuda时会导致出现未知错误:

  

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ numba \ cuda \ decorators.py”,行82,在jitwrapper中      debug = debug)

     

TypeError: init ()得到了意外的关键字参数'debug'

在我从该错误中可以找到的所有内容之后,除了死胡同,我什么都没有打。我猜这是一个非常简单的修复程序,我完全错过了,但是哦。还应该说此错误仅在运行一次并由于内存过载而崩溃后发生。

os.environ["NUMBA_ENABLE_CUDASIM"] = "1"

os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...

所有数组都是numpy

@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:], 
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)', 
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
    for as_of_date in range(0,ed):
        for ID in range(0,rowCount):
            for num in range(0,n):
                cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]

尝试在命令行中使用nvprofiler运行代码会导致以下错误:

  

警告:当前不支持统一内存分析   配置,因为一对不支持点对点的设备   在此“多GPU”设置中检测到。如果没有对等映射   如果可用,系统将退回到使用零拷贝内存。可能导致   访问统一内存的内核运行速度较慢。更多细节可以   可在以下位置找到:   http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

我意识到我使用的是启用SLI的图形卡(两张卡都是相同的,evga gtx 1080ti,并且具有相同的设备ID),因此我禁用了SLI并添加了“ CUDA_VISIBLE_DEVICES”行以尝试限制为另一张卡,但结果相同。

我仍然可以使用nvprof运行代码,但是cuda函数比njit(parallel = True)和prange慢。通过使用较小的数据量,我们可以运行代码,但是它比target ='parallel'和target ='cpu'慢。

为什么cuda这么慢,这些错误是什么意思?

感谢您的帮助!

编辑: 这是代码的工作示例:

import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
    for as_of_date in range(0,countRow):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))

我能够使用gtx 1080ti在cuda中运行代码,但是,它比并行或cpu运行慢得多。我看过其他与guvectorize相关的文章,但没有一篇文章帮助我了解在guvectorize中运行的最佳选择。有什么方法可以使此代码“对代码友好”,还是仅对数组进行乘法太简单而看不到任何好处?

2 个答案:

答案 0 :(得分:2)

首先,您显示的基本操作是取两个矩阵,将它们转移到GPU,进行一些元素乘法以生成第3个数组,然后将该第3个数组传递回主机。

有可能使numba / cuda guvectorize(或cuda.jit内核)实现的运行速度比朴素的串行python实现快,但我怀疑是否有可能超过编写良好的主机的性能代码(例如使用某种并行化方法,例如guvectorize)执行相同的操作。这是因为主机和设备之间传输的每字节算术强度太低。此操作太简单了。

第二,我认为重要的是,首先要了解数字vectorizeguvectorize的用途。基本原则是从“工人将做什么?”的角度编写ufunc定义。然后允许numba从中衍生出多个工人。您指示numba旋转多个工作程序的方式是传递一个大于给定签名的数据集。应该注意的是 numba不知道如何在ufunc定义内并行化for循环。通过获取您的ufunc定义并在并行工作程序中运行它,可以得到并行的“强度”,其中每个工作程序处理数据的“切片”,但在该切片上运行您的整个ufunc定义。作为补充阅读,我也介绍了一些here方面的内容。

因此,您在实现时遇到的问题是您编写了一个签名(和ufunc),该签名将整个输入数据集映射到单个工作程序。如@talonmies所示,您的底层内核总共有64个线程/工作程序(即使上面关于算术强度的声明除外),在GPU上有趣的程度仍然很小。 64实际上只是一个最小的线程块大小,实际上该线程块中只有1个线程正在执行任何有用的计算工作。一个线程正在以串行方式执行您的整个ufunc,包括所有for循环。

显然,这不是任何人合理使用vectorizeguvectorize的意图。

因此,让我们重新审视您想做的事情。最终,您的ufunc希望将一个数组的输入值乘以另一个数组的输入值,并将结果存储在第3个数组中。我们想重复该过程很多次。如果所有3个数组大小都相同,我们实际上可以使用vectorize来实现这一点,甚至不必求助于更复杂的guvectorize。让我们将这种方法与您的原始方法进行比较,重点放在CUDA内核执行上。这是一个可行的示例,其中t14.py是您的原始代码,并使用事件探查器运行,而t15.py是它的vectorize版本,并确认我们已将multBy数组的大小更改为匹配cvdiscount

$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

我们看到您的应用程序报告的运行时间约为1.244秒,而矢量化版本报告的运行时间约为0.375秒。但是这两个数字都有python开销。如果我们在探查器中查看生成的CUDA内核持续时间,则差异更加明显。我们看到原始内核花费了约1.225秒,而矢量化内核执行了约842微秒(即不到1毫秒)。我们还注意到,现在的计算内核时间比将3个数组转移到GPU或从GPU转移3个数组所需的时间小得多(总共花费约20毫秒),并且我们注意到内核尺寸现在是15625个块(128个)每个线程的总线程数/工作人员总数为2000000,与要执行的乘法运算的总数完全匹配,并且远远超过使用原始代码的仅64个线程(可能实际上只有1个线程)。

鉴于上述vectorize方法的简便性,如果您真正想做的是元素逐个乘法,那么您可以考虑复制multBy以便它在尺寸上与其他两个匹配数组,并完成它。

但是问题仍然存在:像原始问题一样,如何处理不同的输入数组大小?为此,我认为我们需要转到guvectorize(或者,如@talonmies所示,编写您自己的@cuda.jit内核,尽管这些方法都可能无法克服如前所述,向/从设备传输数据的开销)。

为了用guvectorize解决这个问题,我们需要更仔细地考虑已经提到的“切片”概念。让我们重新编写您的guvectorize内核,使其仅在整体数据的“切片”上运行,然后允许guvectorize启动功能启动多个工作人员来解决它,每切片一个工作人员

在CUDA中,我们希望有很多工人。你真的不能有太多。因此,这将影响我们如何“分割”数组,从而使多个工作人员可以采取行动。如果我们要沿着第3个(最后一个n)维度进行切片,则只能使用5个切片,因此最多5个工作人员。同样,如果我们沿第一维或countRow进行切片,则将有100个切片,因此最多100个工作人员。理想情况下,我们将沿第二维或countCol进行切片。但是,为简单起见,我将沿第一维或countRow进行切片。这显然不是最佳选择,但是请参见下面的示例,了解如何处理“按秒切片”问题。按第一个维度进行切片意味着我们将从guvectorize内核中删除第一个for循环,并允许ufunc系统沿着该维度进行并行化(基于传递的数组大小)。该代码可能看起来像这样:

$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
307.05ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
307.79ms  5.9293ms                    -               -         -         -         -  15.259MB  2.5131GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.34ms  1.3440us                    -               -         -         -         -        8B  5.6766MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.54ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
317.27ms  47.398ms              (2 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms  7.3799ms                    -               -         -         -         -  15.259MB  2.0192GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

观察:

  1. 代码更改与删除countCol参数,从guvectorize内核删除第一个for循环以及对函数签名进行适当的更改以反映这一点有关。我们还将签名中的3维函数 修改为二维。毕竟,我们要对3维数据进行二维“切片”,并让每个工作人员在切片上工作。

  2. 由探查器报告的内核尺寸现在是2个块而不是1个。这是有道理的,因为在最初的实现中,实际上只显示了1个“切片”,因此需要1个工作线程,因此1个线程(但numba旋转了1个线程块(共64个线程))。在此实现中,有100个切片,并且numba选择启动2个包含64个工作程序/线程的线程块,以提供所需的100个工作程序/线程。

  3. 分析器报告的内核性能为47.4ms,现在介于原始版本(约1.224s)和大规模并行vectorize版本(约0.001s)之间。因此,从1名工人增加到100名工人可以大大加快工作速度,但是可能会获得更多的性能提升。如果您想知道如何在countCol维度上进行切分,就性能而言,您可能会更接近vectorize版本(请参见下文)。请注意,我们在此处的位置(〜47ms)与矢量化版本(〜1ms)之间的差异足以弥补传输稍大的{{1} }设备的矩阵,以简化multBy的操作。

关于python时序的一些其他评论:我相信python如何为原始,向量化和guvectorize改进版本编译必要内核的确切行为是不同的。如果我们修改t15.py代码以运行“热身”运行,那么至少python时序是一致的,从趋势的角度来看,它与总体运行时间和仅内核时序有关:

vectorize

现在,对评论中的一个问题做出有效回答:“我将如何重铸该问题以沿4000($ cat t15.py import numpy as np from numba import guvectorize,vectorize import time from timeit import default_timer as timer @vectorize(['float64(float64, float64)'], target='cuda') def cVestDiscount (a, b): return a * b multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(100,4000,5)) multBy = np.full_like(discount, 1) cv = np.empty_like(discount) #warm-up run cv = cVestDiscount(multBy, discount) func_start = timer() cv = cVestDiscount(multBy, discount) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) [bob@cluster2 python]$ time python t14.py Function: discount factor cumVest duration (seconds):1.24376320839 real 0m2.522s user 0m1.572s sys 0m0.809s $ time python t15.py Function: discount factor cumVest duration (seconds):0.0228319168091 real 0m1.050s user 0m0.473s sys 0m0.445s $ time python t16.py Function: discount factor cumVest duration (seconds):0.0665760040283 real 0m1.252s user 0m0.680s sys 0m0.441s $ 或“中间”)维度切分?”

我们可以根据在第一维上进行切片的方法进行指导。一种可能的方法是重新排列数组的形状,以使4000维为第一维,然后将其删除,这与我们之前对countCol的处理类似。这是一个可行的示例:

guvectorize

可以预见的是,我们观察到执行时间从分割成100个工人时的约47ms减少到分割成4000个工人时的约9ms。同样,我们观察到numba选择增加63个64个线程的块,每个块总共4032个线程,以处理此“切片”所需的4000个工人。

仍然不及〜1ms $ cat t17.py import numpy as np from numba import guvectorize import time from timeit import default_timer as timer @guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True) def cVestDiscount (countCol, multBy, discount, n, cv): for ID in range(0,countCol): for num in range(0,n): cv[ID][num] = multBy[num] * discount[ID][num] countRow = np.int64(100) multBy = np.float64(np.arange(20000).reshape(4000,5)) discount = np.float64(np.arange(2000000).reshape(4000,100,5)) n = np.int64(5) countCol = np.int64(4000) cv = np.zeros(shape=(4000,100,5), dtype=np.float64) func_start = timer() cv = cVestDiscount(countRow, multBy, discount, n, cv) timing=timer()-func_start print("Function: discount factor cumVest duration (seconds):" + str(timing)) [bob@cluster2 python]$ python t17.py Function: discount factor cumVest duration (seconds):0.0266749858856 $ nvprof --print-gpu-trace python t17.py ==8544== NVPROF is profiling process 8544, command: python t17.py Function: discount factor cumVest duration (seconds):0.0268459320068 ==8544== Profiling application: python t17.py ==8544== Profiling result: Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name 304.92ms 1.1840us - - - - - 8B 6.4437MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 305.36ms 27.392us - - - - - 156.25KB 5.4400GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 306.08ms 6.0208ms - - - - - 15.259MB 2.4749GB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 312.44ms 1.0880us - - - - - 8B 7.0123MB/s Pageable Device Quadro K2000 (0 1 7 [CUDA memcpy HtoD] 313.59ms 8.9961ms (63 1 1) (64 1 1) 63 0B 0B - - - - Quadro K2000 (0 1 7 cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35] 322.59ms 7.2772ms - - - - - 15.259MB 2.0476GB/s Device Pageable Quadro K2000 (0 1 7 [CUDA memcpy DtoH] Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows. SSMem: Static shared memory allocated per CUDA block. DSMem: Dynamic shared memory allocated per CUDA block. SrcMemType: The type of source memory accessed by memory operation/copy DstMemType: The type of destination memory accessed by memory operation/copy $ 内核(为工作人员提供更多可用的并行“切片”)快,但比原始问题中建议的〜1.2s内核快很多。而且,即使有所有python开销,Python代码的总挂壁时间也要快大约2倍。

作为最后的观察,让我们回顾一下我之前所做的陈述(与注释和其他答案中的陈述相似):

  

“我怀疑是否有可能超过编写良好的主机代码的性能(例如使用某种并行化方法,例如guvectorize)来完成相同的工作。”

我们现在可以在t16.py或t17.py中使用方便的测试用例进行测试。为了简单起见,我将选择t16.py。我们只需从vectorize ufunc中删除目标名称即可“将其转换回CPU代码”:

guvectorize

因此,我们看到此仅CPU版本在大约6毫秒内运行了该功能,并且它没有GPU“开销”(例如CUDA初始化)以及与GPU之间的数据复制。总的挂墙时间也是我们最好的度量,大约是0.5s,而我们最好的GPU情况大约是1.0s。因此,由于其每字节数据传输的算术强度低,所以这个特殊问题可能不太适合GPU计算。

答案 1 :(得分:1)

在剖析时,gufunc Numba发出和运行如此缓慢的原因立即变得显而易见(使用CUDA 8.0的numba 0.38.1)

==24691== Profiling application: python slowvec.py
==24691== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
271.33ms  1.2800us                    -               -         -         -         -        8B  5.9605MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
271.65ms  14.591us                    -               -         -         -         -  156.25KB  10.213GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
272.09ms  2.5868ms                    -               -         -         -         -  15.259MB  5.7605GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
274.98ms     992ns                    -               -         -         -         -        8B  7.6909MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
275.17ms     640ns                    -               -         -         -         -        8B  11.921MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
276.33ms  657.28ms              (1 1 1)        (64 1 1)        40        0B        0B         -           -  GeForce GTX 970         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms  3.5128ms                    -               -         -         -         -  15.259MB  4.2419GB/s  GeForce GTX 970         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

运行代码的最终内核启动是使用64个线程的单个块。在理论上每个MP最多具有2048个线程和23 MP的GPU上,这意味着未使用GPU的99.9%的理论处理能力。这似乎是numba开发人员的一个荒谬的设计选择,如果您受到它的阻碍(并且看来确实如此),我会将其报告为错误。

显而易见的解决方案是在CUDA python内核方言中将您的函数重写为@cuda.jit函数,并明确控制执行参数。这样,您至少可以确保代码将以足够的线程运行,以潜在地利用硬件的所有容量。这仍然是一个非常受内存限制的操作,因此您可以在加速方面实现的目标可能要大大小于GPU与CPU的内存带宽之比。而且这可能不足以分摊主机到设备内存传输的成本,因此即使在最佳状态下,也可能无法获得性能提升,即使这还远远没有达到。

简而言之,当心自动编译器生成并行性的危险。...

补充说明,我设法弄清楚了如何获取numba发出的代码的PTX,并且足以说它绝对是致命的(而且很长时间我实际上无法将其全部发布):

{
    .reg .pred  %p<9>;
    .reg .b32   %r<8>;
    .reg .f64   %fd<4>;
    .reg .b64   %rd<137>;


    ld.param.u64    %rd29, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_5];
    ld.param.u64    %rd31, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_11];
    ld.param.u64    %rd32, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    ld.param.u64    %rd34, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_14];
    ld.param.u64    %rd35, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_15];
    ld.param.u64    %rd36, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_16];
    ld.param.u64    %rd37, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_17];
    ld.param.u64    %rd38, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_22];
    ld.param.u64    %rd39, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_23];
    ld.param.u64    %rd40, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_24];
    ld.param.u64    %rd41, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_25];
    ld.param.u64    %rd42, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_26];
    ld.param.u64    %rd43, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_27];
    ld.param.u64    %rd44, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_28];
    ld.param.u64    %rd45, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_29];
    ld.param.u64    %rd46, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_30];
    ld.param.u64    %rd48, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_36];
    ld.param.u64    %rd51, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_43];
    ld.param.u64    %rd53, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_49];
    ld.param.u64    %rd54, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_50];
    ld.param.u64    %rd55, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_51];
    ld.param.u64    %rd56, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_52];
    ld.param.u64    %rd57, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_53];
    ld.param.u64    %rd58, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_54];
    ld.param.u64    %rd59, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_55];
    ld.param.u64    %rd60, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_56];
    ld.param.u64    %rd61, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_57];
    mov.u32     %r1, %tid.x;
    mov.u32     %r3, %ctaid.x;
    mov.u32     %r2, %ntid.x;
    mad.lo.s32  %r4, %r3, %r2, %r1;
    min.s64     %rd62, %rd32, %rd29;
    min.s64     %rd63, %rd39, %rd62;
    min.s64     %rd64, %rd48, %rd63;
    min.s64     %rd65, %rd51, %rd64;
    min.s64     %rd66, %rd54, %rd65;
    cvt.s64.s32 %rd1, %r4;
    setp.le.s64 %p2, %rd66, %rd1;
    @%p2 bra    BB0_8;

    ld.param.u64    %rd126, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_42];
    ld.param.u64    %rd125, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_44];
    ld.param.u64    %rd124, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_35];
    ld.param.u64    %rd123, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_37];
    ld.param.u64    %rd122, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_4];
    ld.param.u64    %rd121, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_6];
    cvt.u32.u64 %r5, %rd1;
    setp.lt.s32 %p1, %r5, 0;
    selp.b64    %rd67, %rd29, 0, %p1;
    add.s64     %rd68, %rd67, %rd1;
    mul.lo.s64  %rd69, %rd68, %rd121;
    add.s64     %rd70, %rd69, %rd122;
    selp.b64    %rd71, %rd48, 0, %p1;
    add.s64     %rd72, %rd71, %rd1;
    mul.lo.s64  %rd73, %rd72, %rd123;
    add.s64     %rd74, %rd73, %rd124;
    ld.u64  %rd2, [%rd74];
    selp.b64    %rd75, %rd51, 0, %p1;
    add.s64     %rd76, %rd75, %rd1;
    mul.lo.s64  %rd77, %rd76, %rd125;
    add.s64     %rd78, %rd77, %rd126;
    ld.u64  %rd3, [%rd78];
    ld.u64  %rd4, [%rd70];
    setp.lt.s64 %p3, %rd4, 1;
    @%p3 bra    BB0_8;

    ld.param.u64    %rd128, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_13];
    ld.param.u64    %rd127, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    selp.b64    %rd80, %rd127, 0, %p1;
    mov.u64     %rd79, 0;
    min.s64     %rd81, %rd128, %rd79;
    min.s64     %rd82, %rd34, %rd79;
    selp.b64    %rd83, %rd39, 0, %p1;
    min.s64     %rd84, %rd40, %rd79;
    min.s64     %rd85, %rd41, %rd79;
    min.s64     %rd86, %rd42, %rd79;
    selp.b64    %rd87, %rd54, 0, %p1;
    min.s64     %rd88, %rd55, %rd79;
    min.s64     %rd89, %rd56, %rd79;
    min.s64     %rd90, %rd57, %rd79;
    mul.lo.s64  %rd91, %rd90, %rd61;
    add.s64     %rd92, %rd53, %rd91;
    mul.lo.s64  %rd93, %rd89, %rd60;
    add.s64     %rd94, %rd92, %rd93;
    mul.lo.s64  %rd95, %rd88, %rd59;
    add.s64     %rd96, %rd94, %rd95;
    add.s64     %rd98, %rd87, %rd1;
    mul.lo.s64  %rd99, %rd58, %rd98;
    add.s64     %rd5, %rd96, %rd99;
    mul.lo.s64  %rd100, %rd86, %rd46;
    add.s64     %rd101, %rd38, %rd100;
    mul.lo.s64  %rd102, %rd85, %rd45;
    add.s64     %rd103, %rd101, %rd102;
    mul.lo.s64  %rd104, %rd84, %rd44;
    add.s64     %rd105, %rd103, %rd104;
    add.s64     %rd106, %rd83, %rd1;
    mul.lo.s64  %rd107, %rd43, %rd106;
    add.s64     %rd6, %rd105, %rd107;
    mul.lo.s64  %rd108, %rd82, %rd37;
    add.s64     %rd109, %rd31, %rd108;
    mul.lo.s64  %rd110, %rd81, %rd36;
    add.s64     %rd111, %rd109, %rd110;
    add.s64     %rd112, %rd80, %rd1;
    mul.lo.s64  %rd113, %rd35, %rd112;
    add.s64     %rd7, %rd111, %rd113;
    add.s64     %rd8, %rd2, 1;
    mov.u64     %rd131, %rd79;

BB0_3:
    mul.lo.s64  %rd115, %rd59, %rd131;
    add.s64     %rd10, %rd5, %rd115;
    mul.lo.s64  %rd116, %rd44, %rd131;
    add.s64     %rd11, %rd6, %rd116;
    setp.lt.s64 %p4, %rd3, 1;
    mov.u64     %rd130, %rd79;
    mov.u64     %rd132, %rd3;
    @%p4 bra    BB0_7;

BB0_4:
    mov.u64     %rd13, %rd132;
    mov.u64     %rd12, %rd130;
    mul.lo.s64  %rd117, %rd60, %rd12;
    add.s64     %rd136, %rd10, %rd117;
    mul.lo.s64  %rd118, %rd45, %rd12;
    add.s64     %rd135, %rd11, %rd118;
    mul.lo.s64  %rd119, %rd36, %rd12;
    add.s64     %rd134, %rd7, %rd119;
    setp.lt.s64 %p5, %rd2, 1;
    mov.u64     %rd133, %rd8;
    @%p5 bra    BB0_6;

BB0_5:
    mov.u64     %rd17, %rd133;
    ld.f64  %fd1, [%rd135];
    ld.f64  %fd2, [%rd134];
    mul.f64     %fd3, %fd2, %fd1;
    st.f64  [%rd136], %fd3;
    add.s64     %rd136, %rd136, %rd61;
    add.s64     %rd135, %rd135, %rd46;
    add.s64     %rd134, %rd134, %rd37;
    add.s64     %rd24, %rd17, -1;
    setp.gt.s64 %p6, %rd24, 1;
    mov.u64     %rd133, %rd24;
    @%p6 bra    BB0_5;

BB0_6:
    add.s64     %rd25, %rd13, -1;
    add.s64     %rd26, %rd12, 1;
    setp.gt.s64 %p7, %rd13, 1;
    mov.u64     %rd130, %rd26;
    mov.u64     %rd132, %rd25;
    @%p7 bra    BB0_4;

BB0_7:
    sub.s64     %rd120, %rd4, %rd131;
    add.s64     %rd131, %rd131, 1;
    setp.gt.s64 %p8, %rd120, 1;
    @%p8 bra    BB0_3;

BB0_8:
    ret;
}

所有这些整数运算都恰好执行一个双精度乘法!