Question

我正在学习汇编以提高我的C ++效率，并尝试使用SIMD指令编写矢量库，但是我需要能够不时访问各个元素，并且想知道是否有更简单的方法来执行它而不是使用VextractF128和Movlpd / Movhpd：

.Data
vecta STRUCT 16
    x REAL8 ?
    y REAL8 ?
    z REAL8 ?
    w REAL8 ?
vecta ENDS

vectb UNION     ;If I understand correctly this will force anything in a to be in b as well
    a YMMWORD ? ;since they share the same space
    b vecta {?,?,?,?}
vectb ENDS

.CODE
Somefunc PROC   ;uses _vectorcall convention and has one parameter to be passed in YMM0
    VMOVAPD [vectb.a], YMM0
    MOVSD   XMM2, [vectb.b.x]  ;this gives the error
    ; make other changes to vectb
    VMOVAPD YMM0, [vectb.a]
    RET
Somefunc ENDP

我还设置了/ arch：SSE2编译器选项，但似乎没有帮助。我试过的其他事情：

Somefunc PROC
    VMOVAPD [vecta.x],YMM0 ; compiler seems to think this is ok
    MOVSD   XMM2, [vecta.x]; as this line is still the only error
    ...
Somefunc ENDP

并且：

Somefunc PROC
    VMOVAPD [vectb.a], YMM0
    MOVSD   XMM2, [vectb.b] ;Now gives a different error :[A2009]"syntax error in expression"
    ...
Somefunc ENDP

Answer 1

我正在尝试使用SIMD指令编写一个矢量库...以提高我的C ++效率

这是基于此的代码审核。我希望这可以帮助您提高代码的效率和质量。

正如英特尔在a nice article with diagrams中所解释的那样，混合使用VEX编码的指令和非VEX指令是一个关键的性能错误。除非您上次使用256b指令后运行vmovsd，否则请使用您要执行的任何其他128b操作的v和vzeroupper版本。

有关编写高效x86 asm的详细信息，请参阅Agner Fog's Optimizing Assembly指南。那里有很多好东西：

如何根据特定微体系结构的性能特征决定使用哪些指令
如何重新排列向量中的数据。这里有一整套表格，例如：＆＃34;组合来自两个向量＆＃34;的数据的指令，或者＃34;可以在向量中广播的指令＆＃34;。
如何在依赖链，延迟和吞吐量方面考虑asm优化。
如何处理Windows与其他所有内容之间的ABI差异。
指令表和详细的微观信息。

有关更多链接，另请参阅x86代码Wiki。

我需要能够不时访问各个元素，并且想知道是否有比使用VextractF128和Movlpd / Movhpd更简单的方法

是的，但速度慢了。为了获得最佳性能，您（或您的C ++编译器）通常需要使用shuffle指令，而不是存储/重新加载到内存。 movlpd / movhpd仅用作存储/加载，而不用于寄存器之间。但是你可以使用movhlps来实现将64位从一个寄存器的高位元素合并到另一个寄存器的低位元素的相同目的。

溢出内存，然后重新加载和修改该内存具有显着的延迟（每个内存往返的5个周期）。然后，您刚从多个窄存储写入的内存中的宽向量加载将遭受存储转发故障，导致另外约10个周期的延迟。

因此，即使Somefunc只执行存储，重新加载标量，再次存储标量，重新加载矢量，它将在涉及其输入/输出的依赖链上引入大约20个周期的延迟IIRC，在Intel Haswell上。

不要存储/重新加载以获取低元素（.x）：它已经是整个向量的低元素，您可以直接使用vmulsd或不管。

e.g。你应该使用

Somefunc PROC   ;uses _vectorcall convention and has one parameter to be passed in YMM0


    ;; VMOVAPD    [vectb.a], YMM0    ; don't do this, it was a bad plan

    ; MOVSD   XMM2, [vectb.b.x]  ;this gives the error
    ;; should be:
    vmovapd    xmm2, xmm0    ; the low element of xmm2 now contains the low element of xmm0.   The high128 of ymm2 is zeroed (instead of preserved like movapd would).
    ; or better: don't even copy it at all.  You can use `xmm0` as a source operand for `v...sd` scalar instructions just fine.


    ;;; Or, if you needed the high double zeroed, use
    vxorps     xmm3, xmm3, xmm3        ; zero ymm3 (not a typo: upper 128 zeroed implicitly).
    vmovsd     xmm2, xmm3, xmm0        ; merge low double of xmm0 into the all-zeros, putting the result in xmm2 while keeping our all-zeros around for future use.

    ;; get  .y:
    vmovhlps   xmm1, xmm3, xmm0        ; merge the high 64b of xmm0 with all-zeros, putting the result in xmm1

    vextractf128  xmm4, ymm0, 1        ; .z in the low element of xmm4, garbage in the high element)

    vmovhlps   xmm5, xmm3, xmm4        ; .w in the low element, zero in the high element


    ; make other changes to vectb


    ;; re-combine with unpcklpd to combine two scalars into the same vector
    ;; and vinsertf128

    ;; Storing and re-loading is not a good plan for re-combining either.
    ;; VMOVAPD    YMM0, [vectb.a]     ; store-forwarding failure here
    RET

您的struct / union声明：

你可能不需要工会。这是汇编语言，只需使操作数大小明确告诉MASM您不希望它根据您定义标签的方式抱怨操作数大小不匹配。

e.g。 vmovapd ymmword ptr [whatever you want], ymm0

更重要的是，使用这样的静态缓冲区会使您的函数不是线程安全的。如果你需要临时空间，你应该在堆栈上预留空间。让它像这样对齐32B：

;; Usually compilers will actually align the stack pointer to 32B
;; but if you can spare another integer register, I think you save insns doing this.
lea    rdx, [rsp-32]
sub    rsp, 48           ; assumes RSP was 16B-aligned
and    rdx, -32          ; Same as ~0x0f

RDX现在指向一个32B对齐的堆栈空间块，如果RSP事先是16B对齐的话，它位于[rsp]或[rsp + 16]。如果您不知道，并且可以将RDX降低到RSP以下，如果您没有红区，这将是不安全的。（Windows没有，其他一切都有）。在这种情况下，sub rsp, 64。

Answer 2

似乎你需要创建一个vectb变量：

.Data
...
...
vectc vectb {?}

.CODE
Somefunc PROC
    VMOVAPD   [vectc.a]  ,   YMM0
    MOVSD     XMM2       ,   [vectc.b.x]
    ...
Somefunc ENDP

ERROR A2070：MOVSD（SSE2）的“无效指令操作数”

2 个答案:

您的struct / union声明：