Question

我注意到Jeff的幻灯片“构建大规模信息检索系统的挑战”，也可以在这里下载：http://research.google.com/people/jeff/WSDM09-keynote.pdf，提到了一种称为“组变量编码”的整数压缩方法。据说比每字节整数编码多7位（多2倍）。我对此很感兴趣并正在寻找这个的实现，或者任何可以帮助我自己实现的细节。

我不是专业人士，也不是新手，欢迎任何帮助！

Answer 1

这是指“变量整数编码”，其中用于在序列化时存储整数的位数不固定为4个字节。对varint in the protocol buffer documentation有一个很好的描述。

它用于编码Google's protocol buffers，您可以浏览protocol buffer source code。

CodedOutputStream包含确切的编码函数WriteVarint32FallbackToArrayInline：

inline uint8* CodedOutputStream::WriteVarint32FallbackToArrayInline(
    uint32 value, uint8* target) {
  target[0] = static_cast<uint8>(value | 0x80);
  if (value >= (1 << 7)) {
    target[1] = static_cast<uint8>((value >>  7) | 0x80);
    if (value >= (1 << 14)) {
      target[2] = static_cast<uint8>((value >> 14) | 0x80);
      if (value >= (1 << 21)) {
        target[3] = static_cast<uint8>((value >> 21) | 0x80);
        if (value >= (1 << 28)) {
          target[4] = static_cast<uint8>(value >> 28);
          return target + 5;
        } else {
          target[3] &= 0x7F;
          return target + 4;
        }
      } else {
        target[2] &= 0x7F;
        return target + 3;
      }
    } else {
      target[1] &= 0x7F;
      return target + 2;
    }
  } else {
    target[0] &= 0x7F;
    return target + 1;
  }
}

如果if的大小保证这些额外字节，则级联target只会在value数组的末尾添加额外的字节。 0x80屏蔽正在写入的字节，value向下移位。据我所知，0x7f掩码使其表示“编码的最后一个字节”。（当OR'ing 0x80时，最高位始终为1，然后最后一个字节清除最高位（通过AND'ing 0x7f）。因此，当读取varints时，您会读取直到你得到一个在最高位为零的字节。

我刚才意识到你特意询问了“Group VarInt编码”。对不起，该代码是关于基本的VarInt编码（仍然比7位快）。基本想法看起来很相似。不幸的是，不用于在协议缓冲区中存储64位数字的内容。如果该代码在某处开源，我不会感到惊讶。

使用varint中的想法和幻灯片中的“群组变量”图表，制作自己的想法应该不会太难：）

这是另一个描述Group VarInt compression的页面，其中包含解码代码。不幸的是，他们提到了公开可用的实现，但他们没有提供参考。

void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
  const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
  const byte* limit = compressed + size;
  uint32_t current_value = 0;
  while (compressed != limit) {
    const uint32_t selector = *compressed++;
    const uint32_t selector1 = (selector & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector1];
    *uncompressed++ = current_value;
    compressed += selector1 + 1;
    const uint32_t selector2 = ((selector >> 2) & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector2];
    *uncompressed++ = current_value;
    compressed += selector2 + 1;
    const uint32_t selector3 = ((selector >> 4) & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector3];
    *uncompressed++ = current_value;
    compressed += selector3 + 1;
    const uint32_t selector4 = (selector >> 6);
    current_value += *((uint32_t*)(compressed)) & MASK[selector4];
    *uncompressed++ = current_value;
    compressed += selector4 + 1;
  }
}

Answer 2

我一直在寻找同样的东西，并在Java中找到了这个GitHub项目： https://github.com/stuhood/gvi/ 看起来很有希望！

Answer 3

在c / c ++中，您可以使用与第一个字节中的值对应的预定义结构，而不是使用位掩码进行解码。完整的示例使用了这个：http://www.oschina.net/code/snippet_12_5083

Answer 4

groupvarint的另一个Java实现：https://github.com/catenamatteo/groupvarint 但我怀疑非常大的开关在Java中有一些缺点

在Jeff的幻灯片中查找有关“组变量编码/解码”的更多详细信息

4 个答案: