JCuda:从设备到主机复制多维数组

时间:2013-08-27 15:04:33

标签: cuda jcuda

我已经和JCuda合作了好几个月了,我无法将多维数组从设备内存复制到主机内存。有趣的是,我在相反的方向上没有遇到任何问题(我可以使用多维数组调用我的内核,并且所有内容都使用正确的值)。

简而言之,我将内核的结果放在一个二维的short数组中,其中这个数组的第一个维度是线程数,因此每个都可以在不同的位置写入。

这是一个例子:

CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter

// Invoke kernel with pointer_dev as parameter. Now it should contain some results

CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel

cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction

我做错了什么?

修改

显然,有一些限制不允许将设备分配的内存复制回主机,如此(以及更多)线程中所述:link

任何解决方法?我正在使用CUDA Toolkit v5.0

1 个答案:

答案 0 :(得分:3)

这里我们将一个二维的整数数组从设备复制到主机。

  1. 首先,创建一个大小等于另一个单维数组大小的单维数组(此处为blockSizeX)。

    CUdeviceptr[] hostDevicePointers = new CUdeviceptr[blockSizeX];
    for (int i = 0; i < blockSizeX; i++)
    {
        hostDevicePointers[i] = new CUdeviceptr();
        cuMemAlloc(hostDevicePointers[i], size * Sizeof.INT);
    }
    
  2. 为指向另一个数组的指针数组分配设备内存,并将数组指针从主机复制到设备。

    CUdeviceptr hostDevicePointersArray = new CUdeviceptr();
    cuMemAlloc(hostDevicePointersArray, blockSizeX * Sizeof.POINTER);
    cuMemcpyHtoD(hostDevicePointersArray, Pointer.to(hostDevicePointers), blockSizeX * Sizeof.POINTER);
    
  3. 启动内核。

    kernelLauncher.call(........, hostDevicePointersArray);
    
  4. 将设备的输出传输到主机。

    int hostOutputData[] = new int[numberofelementsInArray * blockSizeX];
    cuMemcpyDtoH(Pointer.to(hostOutputData), hostDevicePointers[i], numberofelementsInArray * blockSizeX * Sizeof.INT);
    
    for (int j = 0; j < size; j++)
    {
        sum = sum + hostOutputData[j];
    }