Vulkan中的多线程渲染(生成命令缓冲区)比单线程慢

时间:2018-06-29 15:35:16

标签: multithreading c++11 vulkan

我正在尝试实现多线程命令缓冲区的生成(使用每个线程的命令池和辅助命令缓冲区),但是使用多个线程的性能几乎没有提高。

首先,我以为我的线程池代码编写不正确,但是我尝试了Sascha Willems的线程池实现,并且没有任何改变(因此我认为这不是问题)

第二,我搜索了多线程性能问题,发现从不同线程访问相同的变量/资源会导致性能下降,但我仍然无法解决问题。

我还下载了Sascha Willems的多线程代码,然后运行它,效果很好。我修改了工作线程的数量,使用多个线程可以明显提高性能。

以下是一些渲染600个对象(相同模型)的FPS结果。您可以看到我的问题是什么

core count      Sascha Willems's        my result           my result (avg. FPS)
              result ( avg. FPS)       (avg. FPS)        validation layer disabled

    1               45                      30                      55
    2               83                      33                      72
    4               110                     40                      84
    6               155                     42                      103
    8               162                     42                      104
    10              173                     40                      111
    12              175                     40                      119

这是我准备线程数据的地方

void prepareThreadData
{
primaryCommandPool = m_device.createCommandPool (
    vk::CommandPoolCreateInfo (
        vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
        graphicsQueueIdx
    )
);

primaryCommandBuffer = m_device.allocateCommandBuffers (
    vk::CommandBufferAllocateInfo (
        primaryCommandPool,
        vk::CommandBufferLevel::ePrimary,
        1
    )
)[0];

threadData.resize(numberOfThreads);

for (int i = 0; i < numberOfThreads; ++i)
{
    threadData[i].commandPool = m_device.createCommandPool (
        vk::CommandPoolCreateInfo (
            vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
            graphicsQueueIdx
        )
    );

    threadData[i].commandBuffer = m_device.allocateCommandBuffers (
        vk::CommandBufferAllocateInfo (
            threadData[i].commandPool,
            vk::CommandBufferLevel::eSecondary,
            numberOfObjectsPerThread
        )
    );

    for (int j = 0; j < numberOfObjectsPerThread; ++j)
    {
        VertexPushConstant pushConstant = { someRandomPosition()};
        threadData[i].pushConstBlock.push_back(pushConstant);
    }
}
}

这是我的渲染循环代码,我在其中为每个线程分配工作:

while (!display.IsWindowClosed())
{
display.PollEvents();

m_device.acquireNextImageKHR(m_swapChain, std::numeric_limits<uint64_t>::max(), presentCompleteSemaphore, nullptr, &currentBuffer);

primaryCommandBuffer.begin(vk::CommandBufferBeginInfo());
primaryCommandBuffer.beginRenderPass(
    vk::RenderPassBeginInfo(m_renderPass, m_swapChainBuffers[currentBuffer].frameBuffer, m_renderArea, clearValues.size(), clearValues.data()),
    vk::SubpassContents::eSecondaryCommandBuffers);

vk::CommandBufferInheritanceInfo inheritanceInfo = {};
inheritanceInfo.renderPass = m_renderPass;
inheritanceInfo.framebuffer = m_swapChainBuffers[currentBuffer].frameBuffer;

for (int t = 0; t < numberOfThreads; ++t)
{
    for (int i = 0; i < numberOfObjectsPerThread; ++i)
    {
        threadPool.threads[t]->addJob([=]
        {
            std::array<vk::DeviceSize, 1> offsets = { 0 };
            vk::Viewport viewport = vk::Viewport(0.0f, 0.0f, WIDTH, HEIGHT, 0.0f, 1.0f);
            vk::Rect2D renderArea = vk::Rect2D(vk::Offset2D(), vk::Extent2D(WIDTH, HEIGHT));

            threadData[t].commandBuffer[i].begin(vk::CommandBufferBeginInfo(vk::CommandBufferUsageFlagBits::eRenderPassContinue, &inheritanceInfo));
            threadData[t].commandBuffer[i].setViewport(0, viewport);
            threadData[t].commandBuffer[i].setScissor(0, renderArea);
            threadData[t].commandBuffer[i].bindPipeline(vk::PipelineBindPoint::eGraphics, m_graphicsPipeline);
            threadData[t].commandBuffer[i].bindVertexBuffers(VERTEX_BUFFER_BIND, 1, &model.vertexBuffer, offsets.data());
            threadData[t].commandBuffer[i].bindIndexBuffer(model.indexBuffer, 0, vk::IndexType::eUint32);
            threadData[t].commandBuffer[i].pushConstants(pipelineLayout, vk::ShaderStageFlagBits::eVertex, 0, sizeof(VertexPushConstant), &threadData[t].pushConstBlock[i]);
            threadData[t].commandBuffer[i].drawIndexed(model.indexCount, 1, 0, 0, 0);
            threadData[t].commandBuffer[i].end();
        });
    }
}

threadPool.wait();

std::vector<vk::CommandBuffer> commandBuffers;
for (int t = 0; t < numberOfThreads; ++t)
{
    for (int i = 0; i < numberOfObjectsPerThread; ++i)
    {
        commandBuffers.push_back(threadData[t].commandBuffer[i]);
    }
}

primaryCommandBuffer.executeCommands(commandBuffers.size(), commandBuffers.data());
primaryCommandBuffer.endRenderPass();
primaryCommandBuffer.end();

submitQueue(presentCompleteSemaphore, primaryCommandBuffer);
}

如果您对我想念的是什么/我做错了什么有任何想法,请告诉我。

Here是完整的VS 2017项目,如果有人想玩:D

我知道这是一个MESS,但是我只是在学习Vulkan。

1 个答案:

答案 0 :(得分:1)

似乎我找到了问题:我未启用验证层。我禁用了它,并且性能提高了很多,我在问题表中更新了第4行进行比较。谁知道验证层会消耗大量的运行时间。 如果有人想衡量Vulkan的性能,别忘了禁用它!