Question

我有一段非常关键的代码需要优化，它确实将字节数组转换为单词数组，反之亦然。该操作用于在8位和16位图像数据之间转换。

该数组是qword对齐的，并且足够存储结果。

从字节到字的转换需要乘以257（因此0转换为0，而255则得到65535）

一个简单的解决方案可能是

void simpleBytesToWords(void *ptr, int pixelCount)
{
    for (int i = pixelCount - 1; i >= 0; --i)
        reinterpret_cast<uint16_t*>(ptr)[i] = reinterpret_cast<uint8_t*>(ptr)[i] * 0x101;
}

我还尝试通过一次转换4个字节以使用64位寄存器来提高性能：

void bytesToWords(void *ptr, int pixelCount)
{
    const auto fastCount = pixelCount / 4;

    if (fastCount > 0)
    {
        for (int f = fastCount-1; f >= 0; --f)
        {
            auto bytes = uint64_t{ reinterpret_cast<const uint32_t*>(ptr)[f] };

            auto r2 = uint64_t{ bytes & 0xFF };
            bytes <<= 8;
            r2 |= bytes & 0xFF0000;
            bytes <<= 8;
            r2 |= bytes & 0xFF00000000ull;
            bytes <<= 8;
            r2 |= bytes & 0xFF000000000000ull;

            r2 *= 0x101;

            reinterpret_cast<uint64_t*>(ptr)[f] = r2; 
        }
    }

    if (pixelCount % 4)
    {
        auto source = reinterpret_cast<const uint8_t*>(ptr);
        auto target = reinterpret_cast<uint16_t*>(ptr);

        for (int i = fastCount * 4; i < pixelCount; ++i)
        {
            target[i] = (source[i] << 8) | source[i];
        }
    }

}

它正在运行，并且比简单的解决方案要快一点。

另一个方向（字到字节）由以下代码完成：

for (int i = 0; i < pixelCount; ++i)
    reinterpret_cast<uint8_t*>(bufferPtr)[i] = reinterpret_cast<uint16_t*>(bufferPtr)[i] / 256;

我一直在寻找编译器内在函数来加快这种转换的速度，但是没有发现任何有用的东西。还有其他方法可以改善此转化的效果吗？

Answer 1

编译代码后，我尝试了两种方法（我刚刚将bytesToWords()重命名为现在的groupedBytesToWords()）：

测试您的两个函数：它们不会产生相同的结果。使用simpleBytesToWords()，我得到一个零填充的数组。有了groupedBytesToWords()，我最终会得到有效结果和零的交替。
在不更改它们的情况下，假设错误修正不会改变它们的复杂性，我尝试了我编写的第三篇，并且使用了必须构建的预先计算的uint8_t-> uint16_t表最初：

这是这张桌子。它很小，因为它只有255个条目，每个可能的uint8_t中有一个：

// Build a precalculation table for each possible uint8_t -> uint16_t conversion 
const size_t sizeTable(std::numeric_limits<uint8_t>::max());

uint16_t * precalc_table = new uint16_t[sizeTable];

for (uint16_t i = 0; i < sizeTable; ++i)
{
    precalc_table[i] = i * 0x101;
}

我尝试的第三个功能如下：

void hopefullyFastBytesToWords(uint16_t *ptr, size_t pixelCount, uint16_t const * precalc_table)
{
    for (size_t i = 0; i < pixelCount; ++i)
    {
        ptr[i] = precalc_table[ptr[i]];
    }
}

我当然对其进行了测试，并且根据您在原始帖子中所做的描述，它产生的结果看起来像。通过传递与我们对其他两个函数相同的参数以及预先计算的转换表来调用此函数：

hopefullyFastBytesToWords(buffer, sizeBuf, precalc_table);

然后，我使用一个500000000 uint16_t长的数组进行了一些比较，该数组最初填充了随机的uint8_t值。这是使用您编写的simpleBytesToWords()的示例：

fillBuffer(buffer, sizeBuf);
begin = clock();
simpleBytesToWords(buffer, sizeBuf);
end = clock();
std::cout << "simpleBytesToWords(): " << (double(end - begin) / CLOCKS_PER_SEC) << std::endl;

我获得了以下结果（您会看到我使用了一台速度缓慢的小型笔记本电脑）。这是三个示例，但是它们始终会产生相似大小的值：

$ Sandbox.exe
simpleBytesToWords(): 0.681
groupedBytesToWords(): 1.2
hopefullyFastBytesToWords(): 0.461

$ Sandbox.exe
simpleBytesToWords(): 0.737
groupedBytesToWords(): 1.251
hopefullyFastBytesToWords(): 0.414

$ Sandbox.exe
simpleBytesToWords(): 0.582
groupedBytesToWords(): 1.173
hopefullyFastBytesToWords(): 0.436

这当然不能代表真实的实际有效基准，但是它表明您的“分组”功能在我的机器上速度较慢，这与您获得的结果不一致。它也显示出比预先计算乘法要好，而不是即时进行转换/乘法会有所帮助。

将字节数组（uint8_t）转换为单词数组（uint16_t），反之亦然

1 个答案: