Question

我正在将我的算法移植到ml64的程序集中，一半用于运动，一半用于查看我实际可以获得多少性能。

无论如何，目前我正在尝试理解堆栈帧设置，据我所知这个例子：

push rbp        ; inherited, base pointer of caller, pushed on stack for storage
mov rbp, rsp    ; inherited, base pointer of the callee, moved to rbp for use as base pointer
sub rsp, 32     ; intel guide says each frame must reserve 32 bytes for the storage of the
                ; 4 arguments usually passed through registers
and spl, -16    ; 16 byte alignment?


mov rsp, rbp    ; put your base pointer back in the callee register
pop rbp         ; restore callers base pointer

我没有得到的两件事是

如何从RSP中减去32来做任何事情？据我所知，除了从一个堆栈框架到另一个堆栈框架的职责外，它只是另一个注册表，对吧？我怀疑它是进入另一个堆栈帧而不是用于当前的堆栈帧。
什么是SPL？为什么屏蔽它会使16字节对齐？

Answer 1

push rbp        ;save non-volatile rbp
mov rbp, rsp    ;save old stack
sub rsp, 32     ;reserve space for 32 bytes of local variables = 8 integers
                ;or 4 pointers.
                ;this is per the MS/Intel guides. You can use this as temp
                ;storage for the parameters or for local variables.
and spl, -16    ;align stack by 16 bytes (for sse code)


mov rsp, rbp    ;restore the old stack
pop rbp         ;restore rbp

如何从RSP中减去32来做任何事情

RSP是堆栈指针，而不是只是另一个寄存器。对它做任何事都会影响堆栈。在这种情况下，它会在堆栈上保留8x4 = 32字节的空间，以便放置本地变量。

什么是SPL？为什么屏蔽它会使16字节对齐？

and rsp,-16强制四个LSB为零。并且因为堆栈增长了，所以将它对齐16个字节使用SSE代码时需要16字节的对齐，x64用于浮点数学。具有16字节对齐允许编译器使用更快对齐的SSE加载和存储指令 SPL是RSP的低8位。为什么编译器选择这样做是没有意义的。两条指令都是4字节，and rsp,-16严格来说更好，因为它不会调用部分寄存器更新。

Disassembly:

0:  40 80 e4 f0       and    spl,-16   ;bad! partial register update.
4:  48 83 e4 f0       and    rsp,-16   ;good
8:  83 e4 f0          and    esp,-16   ;not possible will zero upper 32 bits of rsp

[RSP]只是另一个注册表，对吗？

不，RSP神奇特别它指向the stack，这是PUSH和POP指令作用的地方所有局部变量和参数（不适合寄存器）都存储在堆栈中。

了解fastcall

X64中只有一个调用约定。如果指定除__fastcall之外的调用约定，则更令人困惑的是，大多数编译器会在X64上将其重新映射到__fastcall。

了解fastcall堆栈框架

1 个答案: