减少OpenCL中的矩阵行

时间:2012-11-14 18:01:36

标签: matrix sum opencl gpgpu reduction

我有一个矩阵作为一维数组存储在GPU中,我正在尝试制作一个OpenCL内核,它将在该矩阵的每一行中使用缩减,例如:

让我们考虑我的矩阵是2x3的元素[1,2,3,4,5,6],我想要做的是:

[1, 2, 3] = [ 6]
[4, 5, 6]   [15]

显然,当我谈论减少时,每行的实际回报可能超过一个元素:

[1, 2, 3] = [3, 3]
[4, 5, 6]   [9, 6]

然后我可以在另一个内核或CPU中进行最终计算。

好吧,到目前为止,我所拥有的是一个内核,它使用数组的所有元素进行缩减,如下所示:

[1, 2, 3] = [21]
[4, 5, 6]

这样做的实际还原内核就是那个(我实际上是从stackoverflow中得到的):

__kernel void
sum2(__global float *inVector, __global float *outVector,
     const unsigned int inVectorSize, __local float *resultScratch)
{
  const unsigned int localId = get_local_id(0);
  const unsigned int workGroupSize = get_local_size(0);

  if (get_global_id(0) < inVectorSize)
    resultScratch[localId] = inVector[get_global_id(0)];
  else
    resultScratch[localId] = 0;

  for (unsigned int a = workGroupSize >> 1; a > 0; a >>= 1)
  {
    barrier(CLK_LOCAL_MEM_FENCE);
    if (a > localId)
      resultScratch[localId] += resultScratch[localId + a];
  }

  if (localId == 0)
    outVector[get_group_id(0)] = resultScratch[0];
  barrier(CLK_LOCAL_MEM_FENCE);
}

1 个答案:

答案 0 :(得分:0)

我认为一种解决方案是修改缩减内核,这样可以减少数组的部分。

__kernel void
sum2(__global float *inVector,
     __global float *outVector,
     unsigned int   inVectorOffset,
     unsigned int   inVectorSize,
     __local float  *resultScratch)
{
  const unsigned int localId = get_local_id(0);
  const unsigned int workGroupSize = get_local_size(0);

  if (get_global_id(0) < inVectorSize)
    resultScratch[localId] = inVector[inVectorOffset + get_global_id(0)];
  else
    resultScratch[localId] = 0;

  for (unsigned int a = workGroupSize >> 1; a > 0; a >>= 1)
  {
    barrier(CLK_LOCAL_MEM_FENCE);
    if (a > localId)
      resultScratch[localId] += resultScratch[localId + a];
  }

  if (localId == 0)
    outVector[get_group_id(0)] = resultScratch[0];
  barrier(CLK_LOCAL_MEM_FENCE);
}

然后你可以减少矩阵的一行,提供行的开头inVectorOffset和行中元素的inVectorSize。