Question

我在AVX2上工作，需要计算64位x64位 - ＆gt; 128位加宽乘法，以最快的方式获得64位高电平。由于AVX2没有这样的指令，使用Karatsuba算法提高效率和提高速度是否合理？

Answer 1

没有。在现代架构中，Karatsuba击败教科书乘法的交叉通常介于8到24个机器字之间（例如x86_64上的512到1536位之间）。对于固定大小，阈值处于该范围的较小端，并且新的ADCX / ADOX指令可能会使标量代码更进一步，但64x64仍然太小而无法从Karatsuba中受益。

Answer 2

It's highly unlikely that AVX2 will beat the mulx instruction在一条指令中执行64bx64b到128b。我有一个例外，我知道large multiplications using floating point FFT。

但是，如果你不需要64bx64b到128b，你可以考虑 53bx53b到106b使用double-double arithmetic。

要将四个53位数a和b相乘以得到四个106位数，只需要两条指令：

__m256 p = _mm256_mul_pd(a,b);
__m256 e = _mm256_fmsub_pd(a,b,p);

这使得两条指令中的4个106位数字与使用mulx的一条指令中的一个128位数字相比。

Answer 3

It's hard to tell without trying, but it might me faster to just use the AMD64 MUL instruction, which supports 64x64=128 with the same throughput as most AVX2 instructions (but not vectorized). The drawback is that you need to load to regular registers if the operands were in YMM registers. That would give something like LOAD + MUL + STORE for a single 64x64=128.

If you can vectorize Karatsuba in AVX2, try both AVX2 and MUL and see which is faster. If you can't vectorize, single MUL will probably be faster. If you can remove the load and store to regular registers, single MUL will be definitely faster.

Both MUL and AVX2 instructions can have an operand in memory with the same throughput, and it may help to remove one load for MUL.

在64位x 64位乘法中使用Karatsuba算法真的很有效吗？

3 个答案: