Question

我正在运行一些GPU基准测试，以了解如何最大化内存带宽从/到全球记忆。我有一个128 MB的数组（32 * 1024 * 1024 单精度浮点数）与128字节边距对齐实际数据前后的三个晕值。所以，第一个元素数组与128字节边界对齐。

在下文中，n指的是我的数组中的元素数量（不包括晕）：n = 32*1024*1024。 m指的是数组中的128字节字：m = 1024*1024 = 1048576。

*array     // Aligned to a 128-bytes boundary
*(array-3) // Start of the (unaligned) halo region

我也有一个类似的输出数组，它与同一个边界对齐不包含光环。

我有几个内核实现了不同的类似计算访问模式：

P1: *(output+i) = *(array+i) // for i in 0..n
P2: *(output+i) = *(array+i) + *(array+i+1)
P3: *(output+i) = *(array+i-1) + *(array+i+1)

所有这些计算显然都是带宽限制的。我正在努力优化全局内存事务。我正在使用的代码非常简单：

__global__ void P1(const float* input, float* output)
{
    const int i = threadIdx.x + blockDim.x*blockIdx.x;
    *(output+i) = *(input+i);
}

__global__ void P2(const float* input, float* output)
{
    const int i = threadIdx.x + blockDim.x*blockIdx.x;
    *(output+i) = *(input+i) + *(input+i+1);
}

__global__ void P3(const float* input, float* output)
{
    const int i = threadIdx.x + blockDim.x*blockIdx.x;
    *(output+i) = *(input+i-1) + *(input+i+1);
}

我每个块有1024个线程和正确的数量块，这样每个线程只分配一个输出数组的值。

我使用缓存和非缓存选项（-Xptxas -dclm={ca,cg}）进行了编译并使用nvprof进行基准测试，提取以下指标：

ldst_issued：已发布加载/存储说明
ldst_executed：执行加载/存储说明
gld_transactions：全局加载交易
gst_transactions：全球商店交易
dram_read_throughput：设备内存读取吞吐量
dram_write_throughput：设备内存写入吞吐量

我正在使用的GPU是Nvidia K20X。

我希望ldst_executed正好是(k+1) * m，其中对于P1，k为1，即2 用于P2，3用于P3，表示每个线程读取的值的数量。一世也期望gst_transactions为m（合并访问：写入128个字节对于P1，在m和2m之间的某个地方因为每个warp必须访问，所以对于P2以及介于m和3m之间的某个位置它的“已分配”部分内存就像P1一样，加上以下128个字节 P2，加上P3的前128个字节，但我不确定warp是否为正确的单位在这里我期待一些线程能够避免全局内存访问，因为数据已被a提取到L1缓存中上一个帖子。

结果如下：

P1：

     gld_transactions   1048576
     gst_transactions   1048576
          ldst_issued   2097152
        ldst_executed   2097152
 dram_read_throughput   92.552 GB/s
dram_write_throughput   93.067 GB/s

P2：

     gld_transactions   3145728
     gst_transactions   1048576
          ldst_issued   5242880
        ldst_executed   3145728
 dram_read_throughput   80.748 GB/s
dram_write_throughput   79.878 GB/s

P3：

     gld_transactions   5242880
     gst_transactions   1048576
          ldst_issued   8052318
        ldst_executed   4194304
 dram_read_throughput   79.693 GB/s
dram_write_throughput   78.510 GB/s

我已经看到一些差异：

负载交易数量从P1大幅增加到P2和 P3。
P2和P3中发布的加载/存储指令数量也非常高，超出我能解释的范围。我不确定我明白这个号码是什么表示。

当我转向非缓存测试时，这些是结果

P1：

     gld_transactions   1048576
     gst_transactions   1048576
          ldst_issued   2097152
        ldst_executed   2097152
 dram_read_throughput   92.577 GB/s
dram_write_throughput   93.079 GB/s

P2：

     gld_transactions   3145728
     gst_transactions   1048576
          ldst_issued   5242880
        ldst_executed   3145728
 dram_read_throughput   80.857 GB/s
dram_write_throughput   79.959 GB/s

P3：

     gld_transactions   5242880
     gst_transactions   1048576
          ldst_issued   8053556
        ldst_executed   4194304
 dram_read_throughput   79.661 GB/s
dram_write_throughput   78.484 GB/s

如您所见，没有任何变化。我期待看到一些差异事实上，在非缓存的情况下，L1缓存被丢弃，但是事务以32字节的单词发生。

问题：

我的方法听起来完全没有了吗？
共享内存可以帮助我减少转移量吗？
为什么我看不到缓存和缓存之间存在实质性差异非缓存案例？
为什么P3不慢于P2，P2的速度与P1相同？
哪些其他指标可以帮助我了解实际发生的情况？

如何解释cuda带宽限制内核的nvprof结果？

0 个答案: