Question

根据cachegrind，这个校验和计算例程是整个应用程序中指令缓存加载和指令缓存未命中的最大因素之一：

#include <stdint.h>

namespace {

uint32_t OnesComplementSum(const uint16_t * b16, int len)  {
    uint32_t sum = 0;

    uint32_t a = 0;
    uint32_t b = 0;
    uint32_t c = 0;
    uint32_t d = 0;

    // helper for the loop unrolling
    auto run8 = [&] {
        a += b16[0];
        b += b16[1];
        c += b16[2];
        d += b16[3];
        b16 += 4;
    };

    for (;;) {
        if (len > 32) {
            run8();
            run8();
            run8();
            run8();
            len -= 32;
            continue;
        }

        if (len > 8) {
            run8();
            len -= 8;
            continue;
        }
        break;
    }

    sum += (a + b) + (c + d);

    auto reduce = [&]() {
        sum = (sum & 0xFFFF) + (sum >> 16);
        if (sum > 0xFFFF) sum -= 0xFFFF;
    };

    reduce();

    while ((len -= 2) >= 0) sum += *b16++;

    if (len == -1) sum += *(const uint8_t *)b16; // add the last byte

    reduce();

    return sum;
}    

} // anonymous namespace     

uint32_t get(const uint16_t* data, int length)
{
    return OnesComplementSum(data, length);
}

See asm output here.

可能是由循环展开引起的，但生成的目标代码看起来并不过分。

如何改进代码？

更新

因为校验和函数位于匿名命名空间中，所以它由两个驻留在同一个cpp文件中的函数内联并复制。
循环展开仍然有益。删除它会降低代码速度。
改善无限循环speeds up the code（但出于某种原因，我在我的mac上获得相反的结果）。
- Before fixes: here you can see the two checksums and 17210 L1 IR misses
- After fixes: after fixing the inlining problem and fixing the infinite loop the L1 instruction cache misses dropped to 8324.
- ＆＃34; InstructionFetch＆＃34;在固定的例子中更高。我不确定如何解释。它只是意味着大多数活动发生的地方吗？或者它暗示了一个问题？

Answer 1

用以下代码替换主循环：

const int quick_len=len/8;
const uint16_t * const the_end=b16+quick_len*4;
len -= quick_len*8;
for (; b16+4 <= the_end; b16+=4)
{
    a += b16[0];
    b += b16[1];
    c += b16[2];
    d += b16[3];
}

如果使用-O3

，似乎无需手动循环展开

此外，测试用例允许进行太多优化，因为输入是静态的并且结果未使用，也打印出结果有助于验证优化版本不会破坏任何内容

我使用的完整测试：

int main(int argc, char *argv[])
{

    using namespace std::chrono;
    auto start_time = steady_clock::now();
    int ret=OnesComplementSum((const uint8_t*)(s.data()+argc), s.size()-argc, 0);
    auto elapsed_ns = duration_cast<nanoseconds>(steady_clock::now() - start_time).count();

    std::cout << "loop=" << loop << " elapsed_ns=" << elapsed_ns << " = " << ret<< std::endl;

    return ret;
}

与theis（CLEAN LOOP）和您的改进版本（UGLY LOOP）以及更长的测试字符串进行比较：

loop=CLEAN_LOOP  elapsed_ns=8365  =  14031
loop=CLEAN_LOOP  elapsed_ns=5793  =  14031
loop=CLEAN_LOOP  elapsed_ns=5623  =  14031
loop=CLEAN_LOOP  elapsed_ns=5585  =  14031
loop=UGLY_LOOP   elapsed_ns=9365  =  14031
loop=UGLY_LOOP   elapsed_ns=8957  =  14031
loop=UGLY_LOOP   elapsed_ns=8877  =  14031
loop=UGLY_LOOP   elapsed_ns=8873  =  14031

在此验证：http://coliru.stacked-crooked.com/a/52d670039de17943

修改

实际上整个功能可以简化为：

uint32_t OnesComplementSum(const uint8_t* inData, int len, uint32_t sum) { const uint16_t * b16 = reinterpret_cast<const uint16_t *>(inData); const uint16_t * const the_end=b16+len/2; for (; b16 < the_end; ++b16) { sum += *b16; } sum = (sum & uint16_t(-1)) + (sum >> 16); return (sum > uint16_t(-1)) ? sum - uint16_t(-1) : sum; }

哪个比使用-O3的OP更好，但在-O2：
时更差
http://coliru.stacked-crooked.com/a/bcca1e94c2f394c7

loop=CLEAN_LOOP elapsed_ns=5825 = 14031 loop=CLEAN_LOOP elapsed_ns=5717 = 14031 loop=CLEAN_LOOP elapsed_ns=5681 = 14031 loop=CLEAN_LOOP elapsed_ns=5646 = 14031 loop=UGLY_LOOP elapsed_ns=9201 = 14031 loop=UGLY_LOOP elapsed_ns=8826 = 14031 loop=UGLY_LOOP elapsed_ns=8859 = 14031 loop=UGLY_LOOP elapsed_ns=9582 = 14031

所以里程可能会有所不同，除非知道确切的架构，否则我会更简单

为什么我的代码导致指令缓存未命中？

更新

1 个答案: