OpenCL内核比普通的Java循环慢

时间:2016-01-02 18:26:03

标签: java performance opencl gpu lwjgl

我一直在研究OpenCL,用于优化代码和并行运行任务,以实现比纯Java更快的速度。现在我遇到了一些问题。

我已经使用LWJGL组建了一个Java程序,据我所知,它应该可以执行几乎相同的任务 - 在这种情况下,将两个数组中的元素一起添加并将结果存储在另一个数组中 - 两种不同的方式:一种是纯Java,另一种是OpenCL内核。我正在使用System.currentTimeMillis()来跟踪每个元素对于具有大量元素(~10,000,000)的数组所花费的时间。无论出于何种原因,纯java循环似乎执行大约3到10次,具体取决于数组大小,比CL程序快。我的代码如下(导入省略):

public class TestCL {

    private static final int SIZE = 9999999; //Size of arrays to test, this value is changed sometimes in between tests

    private static CLContext context; //CL Context
    private static CLPlatform platform; //CL platform
    private static List<CLDevice> devices; //List of CL devices
    private static CLCommandQueue queue; //Command Queue for context
    private static float[] aData, bData, rData; //float arrays to store test data

    //---Kernel Code---
    //The actual kernel script is here:
    //-----------------
    private static String kernel = "kernel void sum(global const float* a, global const float* b, global float* result, int const size){\n" + 
            "const int itemId = get_global_id(0);\n" + 
            "if(itemId < size){\n" + 
            "result[itemId] = a[itemId] + b[itemId];\n" +
            "}\n" +
            "}";;

    public static void main(String[] args){

        aData = new float[SIZE];
        bData = new float[SIZE];
        rData = new float[SIZE]; //Only used for CPU testing

        //arbitrary testing data
        for(int i=0; i<SIZE; i++){
            aData[i] = i;
            bData[i] = SIZE - i;
        }

        try {
            testCPU(); //How long does it take running in traditional Java code on the CPU?
            testGPU(); //How long does the GPU take to run it w/ CL?
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    /**
     * Test the CPU with pure Java code
     */
    private static void testCPU(){
        long time = System.currentTimeMillis();
        for(int i=0; i<SIZE; i++){
            rData[i] = aData[i] + bData[i];
        }
        //Print the time FROM THE START OF THE testCPU() FUNCTION UNTIL NOW
        System.out.println("CPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));
    }

    /**
     * Test the GPU with OpenCL
     * @throws LWJGLException
     */
    private static void testGPU() throws LWJGLException {
        CLInit(); //Initialize CL and CL Objects

        //Create the CL Program
        CLProgram program = CL10.clCreateProgramWithSource(context, kernel, null);

        int error = CL10.clBuildProgram(program, devices.get(0), "", null);
        Util.checkCLError(error);

        //Create the Kernel
        CLKernel sum = CL10.clCreateKernel(program, "sum", null);

        //Error checker
        IntBuffer eBuf = BufferUtils.createIntBuffer(1);

        //Floatbuffer for the first array of floats
        FloatBuffer aBuf = BufferUtils.createFloatBuffer(SIZE);
        aBuf.put(aData);
        aBuf.rewind();
        CLMem aMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, aBuf, eBuf);
        Util.checkCLError(eBuf.get(0));

        //And the second
        FloatBuffer bBuf = BufferUtils.createFloatBuffer(SIZE);
        bBuf.put(bData);
        bBuf.rewind();
        CLMem bMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, bBuf, eBuf);
        Util.checkCLError(eBuf.get(0));

        //Memory object to store the result
        CLMem rMem = CL10.clCreateBuffer(context, CL10.CL_MEM_READ_ONLY, SIZE * 4, eBuf);
        Util.checkCLError(eBuf.get(0));

        //Get time before setting kernel arguments
        long time = System.currentTimeMillis();

        sum.setArg(0, aMem);
        sum.setArg(1, bMem);
        sum.setArg(2, rMem);
        sum.setArg(3, SIZE);

        final int dim = 1;
        PointerBuffer workSize = BufferUtils.createPointerBuffer(dim);
        workSize.put(0, SIZE);

        //Actually running the program
        CL10.clEnqueueNDRangeKernel(queue, sum, dim, null, workSize, null, null, null);
        CL10.clFinish(queue);

        //Write results to a FloatBuffer
        FloatBuffer res = BufferUtils.createFloatBuffer(SIZE);
        CL10.clEnqueueReadBuffer(queue, rMem, CL10.CL_TRUE, 0, res, null, null);

        //How long did it take?
        //Print the time FROM THE SETTING OF KERNEL ARGUMENTS UNTIL NOW
        System.out.println("GPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));

        //Cleanup objects
        CL10.clReleaseKernel(sum);
        CL10.clReleaseProgram(program);
        CL10.clReleaseMemObject(aMem);
        CL10.clReleaseMemObject(bMem);
        CL10.clReleaseMemObject(rMem);

        CLCleanup();
    }

    /**
     * Initialize CL objects
     * @throws LWJGLException
     */
    private static void CLInit() throws LWJGLException {
        IntBuffer eBuf = BufferUtils.createIntBuffer(1);

        CL.create();

        platform = CLPlatform.getPlatforms().get(0);
        devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
        context = CLContext.create(platform, devices, eBuf);
        queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, eBuf);

        Util.checkCLError(eBuf.get(0));
    }

    /**
     * Cleanup after CL completion
     */
    private static void CLCleanup(){
        CL10.clReleaseCommandQueue(queue);
        CL10.clReleaseContext(context);
        CL.destroy();
    }

}

以下是各种测试的一些示例控制台结果:

CPU processing time for 10000000 elements: 24
GPU processing time for 10000000 elements: 88

CPU processing time for 1000000 elements: 7
GPU processing time for 1000000 elements: 10

CPU processing time for 100000000 elements: 193
GPU processing time for 100000000 elements: 943

我的编码是否有问题导致CL更快,或者在这种情况下实际上是预期的?如果案件是后者,那么何时CL更可取?

1 个答案:

答案 0 :(得分:0)

我修改了测试以做一些我认为比简单添加计算成本更高的测试。

关于CPU测试,行:

rData[i] = aData[i] + bData[i];

更改为:

rData[i] = (float)(Math.sin(aData[i]) * Math.cos(bData[i]));

在CL内核中,行:

result[itemId] = a[itemId] + b[itemId];

更改为:

result[itemId] = sin(a[itemId]) * cos(b[itemId]);

我现在正在获得控制台结果,例如:

CPU processing time for 1000000 elements: 154
GPU processing time for 1000000 elements: 11

CPU processing time for 10000000 elements: 8699
GPU processing time for 10000000 elements: 98

(CPU花费的时间比我更长,需要为100000000个元素的测试而烦恼。)

为了检查准确性,我添加了比较rDatares的任意元素的检查,以确保它们相同。我在这里省略了结果,因为它应该足以说它们是相同的。

现在函数更复杂(两个三角函数相乘),看起来CL内核比纯Java循环效率更高。