Question

我目前正在尝试实现一些现有标量代码的AVX2版本（Haswell CPU）。这实现了这样一个步骤：

struct entry {
  uint32_t low, high;
};

// both filled with "random" data in previous loops
std::vector<entry> table;
std::vector<int>   queue;  // this is strictly increasing but
                           // without a constant delta

for (auto index : queue) {
  auto v = table[index];
  uint32_t rank = v.high + __builtin_popcount(_bzhi_u32(v.low, index % 32));
  use_rank(rank); // contains a lot of integer operations which nicely map to avx2
}

我已经用2个收集指令实现了这个，每个指令都加载一个int32，如下所示：

__m256iv_low  = _mm256_i32gather_epi32 (reinterpret_cast<int *>(table.data()) + 0, index, 8);
__m256i v_high = _mm256_i32gather_epi32 (reinterpret_cast<int *>(table.data()) + 1, index, 8);

两个加载这些值有更快的方法吗？我曾考虑使用2个64位加载（只发出读取量的一半=>执行端口的流量减少），然后将生成的向量混洗以获得v_low和v_high，但遗憾的是，据我所知shuffle函数只允许分别随机播放128位。

为Paul R编辑： 此代码是使用我在压缩算法中使用的Burrows Wheeler Transform的子字符串枚举例程的一部分。 table包含位向量的等级数据。高部分包含先前条目中的1的数量，并且下部被掩盖并弹出，然后被添加以获得给定索引前面的最终设置位数。之后会发生更多计算，幸运的是可以很好地并行化。

队列中的增量在开始和结束时都非常高（由于算法的性质）。这导致了大量的缓存未命中，这也是我使用移位从SoA切换到AoS以减少标量代码中加载端口压力的原因。

使用SoA也会产生相同的独立收集指令，但会使访问的缓存行数增加一倍。

编辑（部分答案）： 我尝试使用两个_mm_i32gather_epi64到内存访问次数的一半（因此循环，请参阅here）。

__m256i index; // contains the indices
__m128i low = _mm256_extractf128_si256(index, 0);
__m128i high = _mm256_extractf128_si256(index, 1);
__m256i v_part1 = _mm256_i32gather_epi64(reinterpret_cast<long long int*>(table.data()), low , 8);
__m256i v_part2 = _mm256_i32gather_epi64(reinterpret_cast<long long int*>(table.data()), high, 8);

将我的数据加载到这种格式的两个ymm寄存器中（无c ++）：

register v_part1:
[v[0].low][v[0].high][v[1].low][v[1].high][v[2].low][v[2].high][v[3].low][v[3].high]
register v_part2:
[v[4].low][v[4].high][v[5].low][v[5].high][v[6].low][v[6].high][v[7].low][v[7].high]

是否有一种有效的方法来交错它们以获得原始格式：

register v_low:
[v[0].low][v[1].low][v[2].low][v[3].low][v[4].low][v[5].low][v[6].low][v[7].low]
register v_high:
[v[0].high][v[1].high][v[2].high][v[3].high][v[4].high][v[5].high][v[6].high][v[7].high]

Answer 1

我自己找到了一种使用5条指令对值进行重新排序的方法：

// this results in [01][45][23][67] when gathering
index = _mm256_permute4x64_epi64(index, _MM_SHUFFLE(3,1,2,0));

// gather the values
__m256i v_part1 = _mm256_i32gather_epi64(i, _mm256_extractf128_si256(index, 0), 8);
__m256i v_part2 = _mm256_i32gather_epi64(i, _mm256_extractf128_si256(index, 1), 8);

// seperates low and high values
v_part1 = _mm256_shuffle_epi32(v_part1, _MM_SHUFFLE(3,1,2,0));
v_part2 = _mm256_shuffle_epi32(v_part2, _MM_SHUFFLE(3,1,2,0));

// unpack merges lows and highs: [01][23][45][56]
o1 = _mm256_unpackhi_epi64(v_part1, v_part2);
o2 = _mm256_unpacklo_epi64(v_part1, v_part2);

AVX2收集加载两个整数的结构

1 个答案: