在x86中进行原子测试和设置:内联asm或编译器生成的锁定bts?

时间:2016-01-22 06:29:40

标签: assembly x86 icc xeon-phi

编译为xeon phi时的以下代码抛出 Error: cmovc is not supported on k1om

但它可以正常编译为常规的至强处理器。

#include<stdio.h>
int main()
{
    int in=5;
    int bit=1;
    int x=0, y=1;
    int& inRef = in;
    printf("in=%d\n",in);
    asm("lock bts %2,%0\ncmovc %3,%1" : "+m" (inRef), "+r"(y) : "r" (bit), "r"(x));
    printf("in=%d\n",in);
}

编译器 - icc (ICC) 13.1.0 20130121

相关问题:bit test and set (BTS) on a tbb atomic variable

1 个答案:

答案 0 :(得分:3)

IIRC,第一代Xeon Phi基于P5核心(Pentium和Pentium MMX)。直到P6(又名Pentium Pro)才引入cmov。所以我认为这是正常的。

让编译器通过编写普通的三元运算符来完成它的工作。

其次,cmov是比setc更糟糕的选择,因为你想根据进位标志产生0或1。请参阅下面的我的asm代码。

另请注意,带有内存操作数的bts超级慢,因此您不希望它生成该代码,尤其是。在将x86指令解码为uops的CPU上(如现代Xeon)。根据{{​​3}},即使在P5上,bts m, r也比bts m, i慢得多,所以不要这样做。

只需要求编译器将in放在一个寄存器中,或者更好,但不要使用内联asm。

由于OP显然希望它以原子方式工作,因此最好的解决方案是使用C ++ 11 std::atomic::fetch_or,并将其留给编译器生成lock bts

http://agner.org/optimize/有一个test_and_set函数,但IDK是否有办法将它们打包紧密。也许作为结构中的位域?不太可能。我也没有看到std::bitset的原子操作。

不幸的是,当前版本的gcc和clang不会从lock bts生成fetch_or,即使可以使用更快的立即操作数形式。我想出了以下内容(std::atomic_flag):

#include <atomic>
#include <stdio.h>

// wastes instructions when the return value isn't used.
// gcc 6.0 has syntax for using flags as output operands

// IDK if lock BTS is better than lock cmpxchg.
// However, gcc doesn't use lock BTS even with -Os
int atomic_bts_asm(std::atomic<unsigned> *x, int bit) {
  int retval = 0;  // the compiler still provides a zeroed reg as input even if retval isn't used after the asm :/
  // Letting the compiler do the xor means we can use a m constraint, in case this is inlined where we're storing to already zeroed memory
  // It unfortunately doesn't help for overwriting a value that's already known to be 0 or 1.
  asm( // "xor      %[rv], %[rv]\n\t"
       "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"  // hope that the compiler zeroed with xor to avoid a partial-register stall
        : [x] "+m" (*x), [rv] "+rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}

// save an insn when retval isn't used, but still doesn't avoid the setc
// leads to the less-efficient setc/ movzbl sequence when the result is needed :/
int atomic_bts_asm2(std::atomic<unsigned> *x, int bit) {
  uint8_t retval;
  asm( "lock bts %[bit], %[x]\n\t"
       "setc     %b[rv]\n\t"
        : [x] "+m" (*x), [rv] "=rm"(retval)
        : [bit] "ri" (bit));
  return retval;
}


int atomic_bts(std::atomic<unsigned> *x, unsigned int bit) {
  // bit &= 31; // stops gcc from using shlx?
  unsigned bitmask = 1<<bit;
  //int oldval = x->fetch_or(bitmask, std::memory_order_relaxed);

  int oldval = x->fetch_or(bitmask, std::memory_order_acq_rel);
  // acquire and release semantics are free on x86
  // Also, any atomic rmw needs a lock prefix, which is a full memory barrier (seq_cst) anyway.

  if (oldval & bitmask)
    return 1;
  else
    return 0;
}

godbolt link中所述,xor / set-flags / setc是所有现代CPU的最佳序列,当需要将结果作为0或1值时。我实际上并没有考虑到P5,但是setcc在P5上速度很快,所以应该没问题。

当然,如果你想分支而不是存储它,内联asm和C之间的边界是一个障碍。花两条指令来存储一个0或1,只是为了测试/分支,它会非常愚蠢。

如果它是一个选项,那么gcc6的标志操作数语法肯定值得一试。 (如果你需要一个针对Intel MIC的编译器,可能就不行了。)