写入比在x86上读取更快?

时间:2016-03-21 17:55:16

标签: c performance x86 intel

我在Intel机器上观察到一些非常奇怪的读写访问性能。

我编写了一个首先分配数组的C程序。该程序的代码是[1];您可以通过运行Make来编译它。 (我不使用任何编译优化。)

程序的操作顺序如下:

allocate a char array;
init each element of array to be 1;
use clflush to flush the whole array from cache;
read each cache line of the array by using tmp = array[i];
(Do simple calculation after reading each cache line)
use clflush to flush the whole array from cache;
write each cache line of the array by using array[i] = tmp;
(Do the same simple calculation after reading each cache line)

我在英特尔(R)Xeon(R)CPU E5-1650 v2 @ 3.50GHz(Haswell arch。)上运行程序,禁用了涡轮增压。

我用来运行程序的命令是: sudo ./rw-latency-test-compute 5210 10 1

我得到整个阵列的读取延迟为6670us,而整个阵列的写入延迟为3518us。

有趣的部分是 如果我在读/写缓存行后没有进行任何计算,整个数组的读取延迟为2175us,而整个数组的写入延迟为3687us。

所以做计算似乎加快了执行......: - (

你对这种奇怪的表现有什么建议/解释吗?

程序的整个汇编代码可以在[2]找到。

内循环的汇编代码如下:

0000000000400898 <read_array>:
  400898:   55                      push   %rbp
  400899:   48 89 e5                mov    %rsp,%rbp
  40089c:   53                      push   %rbx
  40089d:   48 83 ec 28             sub    $0x28,%rsp
  4008a1:   48 89 7d d8             mov    %rdi,-0x28(%rbp)
  4008a5:   48 89 75 d0             mov    %rsi,-0x30(%rbp)
  4008a9:   c7 45 e8 00 00 00 00    movl   $0x0,-0x18(%rbp)
  4008b0:   c7 45 e4 00 00 00 00    movl   $0x0,-0x1c(%rbp)
  4008b7:   eb 58                   jmp    400911 <read_array+0x79>
  4008b9:   b8 00 00 00 00          mov    $0x0,%eax
  4008be:   e8 38 ff ff ff          callq  4007fb <sw_barrier>
  4008c3:   8b 45 e4                mov    -0x1c(%rbp),%eax
  4008c6:   48 98                   cltq   
  4008c8:   48 03 45 d8             add    -0x28(%rbp),%rax
  4008cc:   0f b6 00                movzbl (%rax),%eax
  4008cf:   88 45 ef                mov    %al,-0x11(%rbp)
  4008d2:   0f be 45 ef             movsbl -0x11(%rbp),%eax
  4008d6:   89 c1                   mov    %eax,%ecx
  4008d8:   03 4d e8                add    -0x18(%rbp),%ecx
  4008db:   ba 01 80 00 80          mov    $0x80008001,%edx
  4008e0:   89 c8                   mov    %ecx,%eax
  4008e2:   f7 ea                   imul   %edx
  4008e4:   8d 04 0a                lea    (%rdx,%rcx,1),%eax
  4008e7:   89 c2                   mov    %eax,%edx
  4008e9:   c1 fa 0f                sar    $0xf,%edx
  4008ec:   89 c8                   mov    %ecx,%eax
  4008ee:   c1 f8 1f                sar    $0x1f,%eax
  4008f1:   89 d3                   mov    %edx,%ebx
  4008f3:   29 c3                   sub    %eax,%ebx
  4008f5:   89 d8                   mov    %ebx,%eax
  4008f7:   89 45 e8                mov    %eax,-0x18(%rbp)
  4008fa:   8b 55 e8                mov    -0x18(%rbp),%edx
  4008fd:   89 d0                   mov    %edx,%eax
  4008ff:   c1 e0 10                shl    $0x10,%eax
  400902:   29 d0                   sub    %edx,%eax
  400904:   89 ca                   mov    %ecx,%edx
  400906:   29 c2                   sub    %eax,%edx
  400908:   89 d0                   mov    %edx,%eax
  40090a:   89 45 e8                mov    %eax,-0x18(%rbp)
  40090d:   83 45 e4 40             addl   $0x40,-0x1c(%rbp)
  400911:   8b 45 e4                mov    -0x1c(%rbp),%eax
  400914:   48 98                   cltq   
  400916:   48 3b 45 d0             cmp    -0x30(%rbp),%rax
  40091a:   7c 9d                   jl     4008b9 <read_array+0x21>
  40091c:   b8 e1 0f 40 00          mov    $0x400fe1,%eax
  400921:   8b 55 e8                mov    -0x18(%rbp),%edx
  400924:   89 d6                   mov    %edx,%esi
  400926:   48 89 c7                mov    %rax,%rdi
  400929:   b8 00 00 00 00          mov    $0x0,%eax
  40092e:   e8 3d fd ff ff          callq  400670 <printf@plt>
  400933:   48 83 c4 28             add    $0x28,%rsp
  400937:   5b                      pop    %rbx
  400938:   5d                      pop    %rbp
  400939:   c3                      retq   

000000000040093a <write_array>:
  40093a:   55                      push   %rbp
  40093b:   48 89 e5                mov    %rsp,%rbp
  40093e:   53                      push   %rbx
  40093f:   48 83 ec 28             sub    $0x28,%rsp
  400943:   48 89 7d d8             mov    %rdi,-0x28(%rbp)
  400947:   48 89 75 d0             mov    %rsi,-0x30(%rbp)
  40094b:   c6 45 ef 01             movb   $0x1,-0x11(%rbp)
  40094f:   c7 45 e8 00 00 00 00    movl   $0x0,-0x18(%rbp)
  400956:   c7 45 e4 00 00 00 00    movl   $0x0,-0x1c(%rbp)
  40095d:   eb 63                   jmp    4009c2 <write_array+0x88>
  40095f:   b8 00 00 00 00          mov    $0x0,%eax
  400964:   e8 92 fe ff ff          callq  4007fb <sw_barrier>
  400969:   8b 45 e4                mov    -0x1c(%rbp),%eax
  40096c:   48 98                   cltq   
  40096e:   48 03 45 d8             add    -0x28(%rbp),%rax
  400972:   0f b6 55 ef             movzbl -0x11(%rbp),%edx
  400976:   88 10                   mov    %dl,(%rax)
  400978:   8b 45 e4                mov    -0x1c(%rbp),%eax
  40097b:   48 98                   cltq   
  40097d:   48 03 45 d8             add    -0x28(%rbp),%rax
  400981:   0f b6 00                movzbl (%rax),%eax
  400984:   0f be c0                movsbl %al,%eax
  400987:   89 c1                   mov    %eax,%ecx
  400989:   03 4d e8                add    -0x18(%rbp),%ecx
  40098c:   ba 01 80 00 80          mov    $0x80008001,%edx
  400991:   89 c8                   mov    %ecx,%eax
  400993:   f7 ea                   imul   %edx
  400995:   8d 04 0a                lea    (%rdx,%rcx,1),%eax
  400998:   89 c2                   mov    %eax,%edx
  40099a:   c1 fa 0f                sar    $0xf,%edx
  40099d:   89 c8                   mov    %ecx,%eax
  40099f:   c1 f8 1f                sar    $0x1f,%eax
  4009a2:   89 d3                   mov    %edx,%ebx
  4009a4:   29 c3                   sub    %eax,%ebx
  4009a6:   89 d8                   mov    %ebx,%eax
  4009a8:   89 45 e8                mov    %eax,-0x18(%rbp)
  4009ab:   8b 55 e8                mov    -0x18(%rbp),%edx
  4009ae:   89 d0                   mov    %edx,%eax
  4009b0:   c1 e0 10                shl    $0x10,%eax
  4009b3:   29 d0                   sub    %edx,%eax
  4009b5:   89 ca                   mov    %ecx,%edx
  4009b7:   29 c2                   sub    %eax,%edx
  4009b9:   89 d0                   mov    %edx,%eax
  4009bb:   89 45 e8                mov    %eax,-0x18(%rbp)
  4009be:   83 45 e4 40             addl   $0x40,-0x1c(%rbp)
  4009c2:   8b 45 e4                mov    -0x1c(%rbp),%eax
  4009c5:   48 98                   cltq   
  4009c7:   48 3b 45 d0             cmp    -0x30(%rbp),%rax
  4009cb:   7c 92                   jl     40095f <write_array+0x25>
  4009cd:   b8 ee 0f 40 00          mov    $0x400fee,%eax
  4009d2:   8b 55 e8                mov    -0x18(%rbp),%edx
  4009d5:   89 d6                   mov    %edx,%esi
  4009d7:   48 89 c7                mov    %rax,%rdi
  4009da:   b8 00 00 00 00          mov    $0x0,%eax
  4009df:   e8 8c fc ff ff          callq  400670 <printf@plt>
  4009e4:   48 83 c4 28             add    $0x28,%rsp
  4009e8:   5b                      pop    %rbx
  4009e9:   5d                      pop    %rbp
  4009ea:   c3                      retq   

[1] https://github.com/PennPanda/rw-latency-test/blob/master/rw-latency-test-compute.c

[2] https://github.com/PennPanda/rw-latency-test/blob/2da88f1cccba40aba155317567199028b28bd250/rw-latency-test-compute.asm

1 个答案:

答案 0 :(得分:4)

写入比读取更快,因为如果您从RAM读取并使用该值(即,您不必读取和丢弃),则处理器必须在使用该值时停止读取。但是,写入异步进行并且永远不会停止。