Question

我一直在努力提高PSO算法实现的执行时间性能，以便进行图像模板匹配。

基本上，算法会尝试匹配模板的原始图像中的“子图像”（模板）。这是通过使用“粒子”的数组来实现的，这些“粒子”不断更新“fitness”值（每次迭代，它的工作类似于遗传算法），这是此函数的结果（规范化交叉关联）：

double ncc(const CImg<int> &image, const CImg<int> &templateImg, int x, int y){
    double numerator = 0, imgSum = 0, tempSum = 0;

    cimg_forXY(templateImg, i, j){
        numerator += image(x+i, y+j) * templateImg(i, j);

        imgSum += pow(image(x+i, y+j), 2);
        tempSum += pow(templateImg(i, j), 2);
    }

    return numerator / (sqrt(imgSum) * sqrt(tempSum));
}

我正在使用CImg库及其预定义的宏来循环遍历图像。

我通过给每个线程一个粒子数组的子集来处理（我正在使用<future>个线程）来并行化算法：

auto updateSubset = [&](int nThread, int from, int to){
    double score;
    Particle localgbest = gbest, currentParticle;

    for(int i = from; i < to; i++){
        currentParticle = particles[i];

        score = ncc(subsetImgs[nThread], subsetTemplates[nThread], 
            currentParticle.getPosition().getX(), 
            currentParticle.getPosition().getY()
        );

        if(score > currentParticle.getFitness()){
            currentParticle.setFitness(score);
            currentParticle.setPBest(currentParticle.getPosition());

            if(currentParticle.getFitness() > localgbest.getFitness()){
                localgbest = currentParticle;
            }
        }

        particles[i] = currentParticle;
    }

    return localgbest;
};

（gbest，particles[]，subsetImgs[]和subsetTemplates[]通过引用传递）。最后，程序的主循环：

for(int i = 0; i < nIterations; i++){
    for(int j = 0; j < nThreads; j++){
        threads[j] = async(launch::async, updateSubset, j, bounds[j], bounds[j + 1]);
    }

    for(int j = 0; j < nThreads; j++){            
        currentThreadGbest = threads[j].get();

        if(currentThreadGbest.getFitness() > gbest.getFitness())
            gbest = currentThreadGbest;
    }

    //velocity and position update
    for(int j = 0; j < nParticles; j++){
        particles[j].updatePosition(minPos, maxPos, gbest.getPosition());
    }
}

正如你所看到的那样，我试图通过在threads函数中尽可能少地使用引用传递的变量来避免错误共享（这就是为什么我要创建数组subsetImgs[]和{{1} }，两者都填充相同的图像，以避免线程同时读取相同的图像）。正如我研究过的那样，CImg宏没有阻塞，但是我仍然在2线程或更多线程中获得不佳的性能（我的CPU是带有2个核心的AMD A6-4400M，我用{{来衡量时间} 1}}）：

没有该算法的并行版本：subsetTemplates[]
一个帖子的并行版本：clock
包含2个主题的并行版本：time = 3.09362

任何线索为什么会发生这种情况？，提前感谢！（对不起英语）。

使用C ++中的多个线程的性能更差

0 个答案: