Question

在下面的程序中，由于依赖指令，我希望test1运行得更慢。用-O2测试运行似乎证实了这一点。但后来我尝试使用-O3，现在时间或多或少相等。怎么会这样？

#include <iostream>
#include <vector>
#include <cstring>
#include <chrono>

volatile int x = 0; // used for preventing certain optimizations


enum { size = 60 * 1000 * 1000 };
std::vector<unsigned> a(size + x); // `size + x` makes the vector size unknown by compiler 
std::vector<unsigned> b(size + x);


void test1()
{
    for (auto i = 1u; i != size; ++i)
    {
        a[i] = a[i] + a[i-1]; // data dependency hinders pipelining(?)
    }
}


void test2()
{
    for (auto i = 0u; i != size; ++i)
    {
        a[i] = a[i] + b[i]; // no data dependencies
    }
}


template<typename F>
int64_t benchmark(F&& f)
{
    auto start_time = std::chrono::high_resolution_clock::now();
    f();
    auto elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start_time);
    return elapsed_ms.count();
}


int main(int argc, char**)
{   
    // make sure the optimizer cannot make any assumptions
    // about the contents of the vectors:
    for (auto& el : a) el = x;
    for (auto& el : b) el = x;

    test1(); // warmup
    std::cout << "test1: " << benchmark(&test1) << '\n';

    test2(); // warmup        
    std::cout << "\ntest2: " << benchmark(&test2) << '\n';

    return a[x] * x; // prevent optimization and exit with code 0
}

我得到了这些结果：

g++-4.8 -std=c++11 -O2 main.cpp && ./a.out
test1: 115
test2: 48

g++-4.8 -std=c++11 -O3 main.cpp && ./a.out
test1: 29
test2: 38

Answer 1

因为在-O3 gcc中有效地消除了数据依赖性，通过将a[i]的值存储在寄存器中并在下一次迭代中重用它而不是加载a[i-1]。

结果或多或少等同于：

void test1()
{
    auto x = a[0];
    auto end = a.begin() + size;
    for (auto it = next(a.begin()); it != end; ++it)
    {
        auto y = *it; // Load
        x = y + x;
        *it = x; // Store
    }
}

在-O2中编译的版本与-O3中编译的代码完全相同。

您问题中的第二个循环在-O3中展开，因此加速。应用的两个优化似乎与我无关，第一种情况更快，因为gcc删除了加载指令，第二种情况因为它已展开。

在这两种情况下，我都不认为优化器特别做了什么来改善缓存行为，内存访问模式很容易被cpu预测。

Answer 2

优化器是非常复杂的软件，并不总是可预测的。

使用g ++ 5.2.0和-O2 test1和test2编译为类似的机器代码：

  ;;;; test1 inner loop
  400c28:   8b 50 fc                mov    -0x4(%rax),%edx
  400c2b:   01 10                   add    %edx,(%rax)
  400c2d:   48 83 c0 04             add    $0x4,%rax
  400c31:   48 39 c1                cmp    %rax,%rcx
  400c34:   75 f2                   jne    400c28 <_Z5test1v+0x18>

  ;;;; test2 inner loop
  400c50:   8b 0c 06                mov    (%rsi,%rax,1),%ecx
  400c53:   01 0c 02                add    %ecx,(%rdx,%rax,1)
  400c56:   48 83 c0 04             add    $0x4,%rax
  400c5a:   48 3d 00 1c 4e 0e       cmp    $0xe4e1c00,%rax
  400c60:   75 ee                   jne    400c50 <_Z5test2v+0x10>

然而，-O3 test1仍然或多或少相似

  ;;;; test1 inner loop
  400d88:   03 10                   add    (%rax),%edx
  400d8a:   48 83 c0 04             add    $0x4,%rax
  400d8e:   89 50 fc                mov    %edx,-0x4(%rax)
  400d91:   48 39 c1                cmp    %rax,%rcx
  400d94:   75 f2                   jne    400d88 <_Z5test1v+0x18>

虽然test2使用xmm寄存器爆炸到看似展开的版本，但生成完全不同的机器代码。内环变为

  ;;;; test2 inner loop (after a lot of preprocessing)
  400e30:   f3 41 0f 6f 04 00       movdqu (%r8,%rax,1),%xmm0
  400e36:   83 c1 01                add    $0x1,%ecx
  400e39:   66 0f fe 04 07          paddd  (%rdi,%rax,1),%xmm0
  400e3e:   0f 29 04 07             movaps %xmm0,(%rdi,%rax,1)
  400e42:   48 83 c0 10             add    $0x10,%rax
  400e46:   44 39 c9                cmp    %r9d,%ecx
  400e49:   72 e5                   jb     400e30 <_Z5test2v+0x90>

并为每次迭代进行多次添加。

如果您想测试特定的处理器行为，可能直接在汇编程序中编写是一个更好的主意，因为C ++编译器可能会对原始源代码所执行的内容进行大量重写。

为什么这两个循环在使用-O3编译时运行速度相同，但是在使用-O2编译时却没有？

2 个答案: