Question

我正在尝试为C程序构建NASM库。我想舍入作为参数给出的浮点数。

C函数原型如下：

double nearbyint(double x);

我尝试使用frndint指令，但需要先将参数压入堆栈。

这是我想出的（无法编译）：

bits 64

section .text

global nearbyint

nearbyint:
    push    rbp
    mov     rbp, rsp

    fld     xmm0
    frndint
    fstp    xmm0

    leave
    ret

Answer 1

在x87和XMM之间获取数据的唯一方法是通过内存反弹。例如movsd [rsp-8] / fld qword [rsp-8]使用红色区域。

但是您根本不需要使用x87，如果您希望它高效，也不需要使用。

如果您具有SSE4.1，请使用roundsd舍入为整数。

rint： roundsd xmm0,xmm0, 0b0100-当前舍入模式（位2 = 1），如果输入结果==（位3 = 0），则设置不精确的异常标志。
nearbyint：roundsd xmm0,xmm0, 0b1100当前的舍入模式（位2 = 1），不精确的异常被抑制（位3 = 1）。
roundsd xmm0,xmm0, 0b1000：舍入模式覆盖（位2 = 0）到_MM_FROUND_TO_NEAREST_INT（位[1：0] = 00）。有关表格，请参见roundpd docs。不精确的异常被抑制（位3 = 1）。

不使用四舍五入的SSE4.1 ，请看一下glibc's rint的作用：它会添加2^52（位模式0x43300000, 0x00000000），从而得出一个数字大到最接近的可表示double是整数。因此，正常FP舍入到最接近的可表示值会舍入到整数。 IEEE binary64 double具有52个显式尾数（也就是有效位数）位，因此此数字的大小不是巧合。

（对于负输入，它使用-2^52）

再次减去后，您的原始编号会四舍五入。

glibc实现检查某些特殊情况（例如Inf和NaN），并检查小于0的指数（即大小小于1.0的输入），将其复制到输入的符号位中。（我想-0.499会四舍五入为-0.0而不是0，并确保0.499会四舍五入为+0，而不是-0。）

使用SSE2来实现的一种简单方法是将输入的符号位与pand xmm0, [signbit_mask]隔离，然后按FP位模式0x43300000 ...进行OR，得到+- 2^52。 / p>

default rel

;; UNTESTED.  IDK if the SIGNBIT_FIXUP does anything other than +-0.0
rint_sse2:
    ;movsd  xmm1, [signbit_mask]  ; can be cheaply constructed on the fly, unlike 2^52
    ;pand   xmm1, xmm0

    pcmpeqd  xmm1, xmm1
    psrlq    xmm1, 1             ; 0x7FFF...
%ifdef SIGNBIT_FIXUP
    movaps   xmm2, xmm1          ; copy the mask
%endif

    andnps   xmm1, xmm0          ; isolate sign bit

%ifdef SIGNBIT_FIXUP
    movaps   xmm3, xmm1          ; save another copy of the sign bit
%endif

    orps     xmm1, [big_double]  ; +-2^52
    addsd    xmm0, xmm1
    subsd    xmm0, xmm1

%ifdef SIGNBIT_FIXUP
    andps    xmm0, xmm2          ; clear the sign bit
    orps     xmm0, xmm3          ; apply the original sign
%endif
    ret

section .rodata
align 16
   big_double: dq 0x4330000000000000   ; 2^52
   ; a 16-byte load will grab whatever happens to be next
   ; if you want a packed rint(__m128d), use   times 2 dq ....

尤其是如果您省略SIGNBIT_FIXUP的东西，这是相当便宜的，就FP延迟而言，它不比roundsd的2微秒贵。（在大多数CPU上，roundsd的延迟与addsd + subsd相同。这几乎不是巧合，但确实避免了任何单独的操作来分拣符号）。

NASM中的舍入浮点数

1 个答案: