Question

我编写了以下简单的C ++代码。

#include <iostream>
#include <omp.h>

int main()
{
    int myNumber = 0;
    int numOfHits = 0;

    cout << "Enter my Number Value" << endl;
    cin >> myNumber;

    #pragma omp parallel for reduction(+:numOfHits)

    for(int i = 0; i <= 100000; ++i)
    {
        for(int j = 0; j <= 100000; ++j)
        {
            for(int k = 0; k <= 100000; ++k)
            {
                if(i + j + k == myNumber)
                    numOfHits++;
            }
        }
    }

    cout << "Number of Hits" << numOfHits << endl;

    return 0;
}

如您所见，我使用OpenMP来并行化最外层循环。我想做的是在CUDA中重写这个小代码。任何帮助都感激不尽。

Answer 1

好吧，我可以给你一个快速的教程，但我不一定为你写。

首先，您需要使用CUDA设置MS Visual Studio，这很容易遵循本指南：http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/

现在，您需要阅读“NVIDIA CUDA编程指南”（免费pdf），文档和CUDA示例（我强烈建议您学习CUDA）。

但是，让我们说你还没有做到，但肯定会在以后。

这是一个非常算术的重型和数据轻度计算 - 实际上它可以在没有这种强力方法的情况下相当简单地计算，但这不是您正在寻找的答案。我为内核建议这样的东西：

__global__ void kernel(int* myNumber, int* numOfHits){

    //a shared value will be stored on-chip, which is beneficial since this is written to multiple times
    //it is shared by all threads
    __shared__ int s_hits = 0;

    //this identifies the current thread uniquely
    int i = (threadIdx.x + blockIdx.x*blockDim.x);
    int j = (threadIdx.y + blockIdx.y*blockDim.y);
    int k = 0;

    //we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
    for(; i < 100000; i += blockDim.x*gridDim.x){
        for(; j < 100000; j += blockDim.y*gridDim.y){
                  //Thanks to talonmies for this simplification
               if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
                  //you should actually use atomics for this
                 //otherwise, the value may change during the 'read, modify, write' process
                  s_hits++;
               }
        }
    }

    //synchronize threads, so we now s_hits is completely updated
    __syncthreads();

    //again, atomics
    //we make sure only one thread per threadblock actually adds in s_hits
    if(threadIdx.x == 0 && threadIdx.y == 0)
        *numOfHits += s_hits;

    return;
}

要启动内核，您需要以下内容：

dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);

我知道你可能想要一个快速的方法来做到这一点，但进入CUDA并不是一个快速的“快速”。事情。在中，你需要做一些阅读和一些设置才能使它工作;过去，学习曲线不是太高。我还没有告诉你关于内存分配的任何事情，所以你需要这样做（虽然这很简单）。如果你遵循我的代码，我的目标是你必须阅读共享内存和CUDA，所以你已经开始了。祝你好运！

免责声明：我没有测试过我的代码，而且我不是专家 - 它可能是愚蠢的。

将简单的C ++代码段重写为CUDA代码

1 个答案: