为什么cudaMemcpy需要花费这么多时间?

时间:2014-06-04 12:36:00

标签: c++ cuda

我正在编写cuda程序,并且在分析一个函数之后,比如大部分时间在大矩阵成本上做点积:

==27530== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 64.90%  2.25369s        23  97.986ms  9.5590us  1.79533s  cudaMemcpy
 21.04%  730.65ms      1422  513.82us  3.0050us  21.028ms  cudaLaunch
  8.72%  302.72ms         5  60.543ms     477ns  170.92ms  cudaFree
  3.64%  126.54ms        18  7.0298ms  4.8882ms  35.518ms  cudaMallocHost
  1.39%  48.292ms        16  3.0182ms  3.0076ms  3.0601ms  cudaFreeHost
  0.11%  3.9026ms        23  169.68us  64.314us  1.7771ms  cudaMalloc
  0.09%  3.0171ms     17661     170ns     144ns  3.1750us  cudaSetupArgument
  0.04%  1.3514ms       810  1.6680us  1.4000us  9.9270us  cudaBindTexture
  0.02%  569.60us       810     703ns     596ns  4.8010us  cudaUnbindTexture
  0.02%  556.24us       945     588ns     484ns  4.2560us  cudaFuncSetCacheConfig
  0.01%  499.67us      1422     351ns     163ns  198.52us  cudaConfigureCall
  0.01%  256.21us      1310     195ns     150ns     335ns  cudaGetLastError
  0.01%  238.26us       166  1.4350us     165ns  49.141us  cuDeviceGetAttribute
  0.01%  175.44us       945     185ns     157ns     755ns  cudaPeekAtLastError
  0.00%  50.787us         2  25.393us  16.700us  34.087us  cuDeviceGetName
  0.00%  45.330us         2  22.665us  19.024us  26.306us  cuDeviceTotalMem
  0.00%  43.289us         2  21.644us  13.641us  29.648us  cudaMemset
  0.00%  43.029us         2  21.514us  14.059us  28.970us  cudaGetDeviceProperties
  0.00%  13.931us        12  1.1600us     339ns  5.5310us  cudaGetDevice
  0.00%  3.4750us         1  3.4750us  3.4750us  3.4750us  cudaDeviceSynchronize
  0.00%  1.5320us         1  1.5320us  1.5320us  1.5320us  cuDriverGetVersion
  0.00%  1.2690us         3     423ns     241ns     753ns  cuDeviceGetCount
  0.00%  1.0080us         1  1.0080us  1.0080us  1.0080us  cuInit
  0.00%  1.0060us         3     335ns     314ns     377ns  cuDeviceGet

它显示了' cudaMemcpy'花费超过两秒钟。但是我的代码中几乎没有cudaMemcpy调用,并且D-> H或H-> D存储器副本都是固定内存。我不认为我的cudaMemcpy电话会花费很多时间。

大部分时间消耗的功能:

==27530== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 74.35%  2.34598s       112  20.946ms  20.743ms  21.161ms  knl_convolve_filter(float*, float*, int, int, int, float*)

和功能:

__global__ void knl_convolve_filter(float *feature, float *filter, int width, int height, int cell_size, float *convolution) {
    int x =  blockDim.x * blockIdx.x + threadIdx.x;
    int y =  blockDim.y * blockIdx.y + threadIdx.y;

    if( x < width && y < height) {
        if( x & 1) {
            //odd, imaginary part
            float sum = 0.0f;
            size_t offset = (y * width + x - 1) * cell_size ;
            for(int i = 0, total_cell_size = cell_size * 2; i < total_cell_size ; i += 2) {
                float y = *(feature + offset + i) * *(filter + offset + i + 1) + *(feature + offset + i + 1) * *(filter + offset + i);
                sum += y;
            }
            *(convolution + y * width + x) = sum;
        } else {
            //even, real part
            float sum = 0.0f;
            size_t offset = (y * width + x) * cell_size ;
            for(int i = 0, total_cell_size = cell_size * 2; i < total_cell_size ; i += 2) {
                float x = *(feature + offset + i) * *(filter + offset + i) - *(feature + offset + i + 1) * *(filter + offset + i + 1);
                sum += x;
            }
            *(convolution + y * width + x) = sum;
        }

    }
}

我在Fedora 19 64,cuda 6.0上使用GTX760(CC3.0)。我在这里犯了一个大错误吗?

1 个答案:

答案 0 :(得分:3)

很难给出明确的答案,因为我们没有显示任何主机代码,但事实上,似乎有一个非常慢cudaMemcpy来电分析序列消耗1.79533秒。其他20多个呼叫每个平均只需要大约20毫秒。所以真正的问题是“为什么这个特定的cudaMemcpy调用需要1.79533秒?”,我怀疑答案是它在CUDA运行时API中吸收了大量的延迟设置延迟。

现代版本的CUDA工具包附带的nvprof配置文件实用程序可以选择发出详细的API时间轴。对该时间表的分析肯定会回答您的问题,但如果没有主机代码或API跟踪,这就是可以提供的具体答案。