Question

我正在尝试优化512w x 1024h图像中每行平均值的计算，然后从计算它的行中减去平均值。我写了一段代码，它在1.86 ms中执行，但我想降低速度。这段代码工作正常，但不使用共享内存，它使用for循环。我想取消他们。

__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {

  // height = 1024, width = 512

  int tidy = threadIdx.x + blockDim.x * blockIdx.x; 

  float sum = 0.0f; 
  float sumDiv = 0.0f; 

  if(tidy < height) { 

      for(int c = 0; c < width; c++) { 

          sum += img[tidy*width + c];
      }
      sumDiv = (sum/width)/2;

      //__syncthreads(); 

      for(int cc = 0; cc < width; cc++) { 

          lineImg[tidy*width + cc] = img[tidy*width + cc] - sumDiv;
      }

  }

  __syncthreads();

我用以下方法调用了上面的内核：

subtractMean <<< 2, 512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);

但是，我编写的以下代码使用共享内存进行优化。但是，它没有按预期工作。对问题可能是什么的任何想法？

__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {

  extern __shared__ float perRow[];

  int idx = threadIdx.x;    // set idx along x
  int stride = width/2; 

  while(idx < width) { 
      perRow[idx] = 0; 
      idx += stride; 
  }

  __syncthreads(); 

  int tidx = threadIdx.x;   // set idx along x
  int tidy = blockIdx.x;    // set idx along y

  if(tidy < height) { 
      while(tidx < width) { 
          perRow[tidx] = img[tidy*width + tidx];
          tidx += stride; 
      }
  }

  __syncthreads(); 

  tidx = threadIdx.x;   // reset idx along x
  tidy = blockIdx.x;    // reset idx along y

  if(tidy < height) { 

      float sumAllPixelsInRow = 0.0f; 
      float sumDiv = 0.0f; 

      while(tidx < width) { 
          sumAllPixelsInRow += perRow[tidx];
          tidx += stride;
      }
      sumDiv = (sumAllPixelsInRow/width)/2;

      tidx = threadIdx.x;   // reset idx along x

      while(tidx < width) { 

          lineImg[tidy*width + tidx] = img[tidy*width + tidx] - sumDiv; 
          tidx += stride;
      }
  }

  __syncthreads();  
}

使用以下方法调用共享内存函数：

subtractMean <<< 1024, 256, sizeof(float)*512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);

Answer 1

2个块几乎不足以使GPU使用饱和。您正在使用更多块来实现正确的方法，但是，您正在使用Kepler，我想提供一个根本不使用共享内存的选项。

从块中的32个线程开始（稍后可以使用2D块更改）有了这32个线程，你应该按照以下方式做点什么：

int rowID = blockIdx.x;
int tid   = threadIdx.x;
int stride= blockDim.x;
int index = threadIdx.x;
float sum=0.0;
while(index<width){
    sum+=img[width*rowID+index];
    index+=blockDim.x;
}

此时你将有32个线程，每个线程都有一个部分和。接下来需要将它们全部添加在一起。你可以通过利用shuffle reduction来不使用共享内存（因为我们处于warp中）就可以做到这一点。有关详细信息，请参阅此处：http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/您想要的是shuffle warp reduce，但您需要将其更改为使用完整的32个线程。

现在每个warp中的线程0都有每行的总和，你可以将它除以宽度转换为float，然后使用shfl使用shfl(average, 0);将其广播到warp的其余部分。 http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-description

通过隐式和明确地（使用shfl）找到平均值和warp，你可以继续使用类似的减法方法。

可能的进一步优化是在块中包含多个warp以提高占用率，并在宽度上手动展开循环以提高指令级并行性。

祝你好运。

CUDA - 使用共享内存优化矩阵行计算的平均值

1 个答案: