Question

我在生活中看到过许多核心转储，但这一次让我感到难过。

上下文：

在AMD Barcelona CPU
崩溃的代码执行 lot
在负载下运行1000个程序实例（完全相同的优化二进制文件），每小时产生1-2次崩溃
崩溃发生在不同的机器上（但机器本身非常相同）
崩溃看起来都一样（相同的地址，同一个调用堆栈）

以下是崩溃的详细信息：

Program terminated with signal 11, Segmentation fault.
#0  0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)

(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>:    mov    (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>:    mov    %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>:    callq  *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>:    cmp    %eax,%r12d
0x17bd9fc <_Z3Foov+348>:    mov    %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>:    jge    0x17bd97e <_Z3Foov+222>

您会注意到崩溃发生在0x17bd9fc的指令的中间，这是从0x17bd9f6的呼叫返回到虚拟功能之后。

当我检查虚拟表时，我发现它没有以任何方式被破坏：

(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>:  0x2d3d7b0 <_ZN4Foo13GetEv>

并指出这个微不足道的功能（正如通过查看来源所预期的那样）：

(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

此外，当我查看Foo1::Get()应该返回的返回地址时：

(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

我看到它指向正确的指令，所以就好像从Foo1::Get()返回时，一些gremlin出现并且%rip增加了4。

合理的解释？

Answer 1

因此，看起来不太可能，我们似乎遇到了真正的真正的CPU错误。

http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf有错误＃721：

721处理器可能会错误地更新堆栈指针

描述

Under a highly specific and detailed set of internal timing conditions,
the processor may incorrectly update the stack pointer after a long series
of push and/or near-call instructions, or a long series of pop 
and/or near-return instructions. The processor must be in 64-bit mode for
this erratum to occur.

对系统的潜在影响

The stack pointer value jumps by a value of approximately 1024, either in
the positive or negative direction.
This incorrect stack pointer causes unpredictable program or system behavior,
usually observed as a program exception or crash (for example, a #GP or #UD).

Answer 2

我曾经在一条指令中间看到“非法操作码”崩溃。我正在研究Linux端口。简而言之，Linux从指令指针中减去以重新启动系统调用，在我的情况下，这发生了两次（如果两个信号同时到达）。

这是一个可能的罪魁祸首：内核摆弄你的指令指针。在你的情况下可能还有其他一些原因。

请记住，有时处理器会理解它作为指令处理的数据，即使它不应该是。因此处理器可能已经在0x17bd9fa执行了“指令”，然后转到0x17bd9fd，然后生成了非法的操作码异常。（我刚刚编了这个号码，但尝试使用反汇编程序可以向您显示处理器可能“输入”指令流的位置。）

快乐的调试！

“无法解释的”核心转储

2 个答案: