Question

tl; dr：为什么sparse_hash_map中的密钥查找对特定数据的速度变慢约50倍？

我正在使用我编写的一个非常简单的Cython包装器测试来自Google的sparsehash库的sparse_hash_map的键查找的速度。哈希表包含uint32_t个键和uint16_t值。对于随机密钥，值和查询，我获得的查询速度超过1M次/秒。但是，对于我需要的特定数据，性能几乎不超过20k查询/秒。

包装器为here。缓慢运行的表是here。基准代码是：

benchmark.pyx：

from sparsehash cimport SparseHashMap
from libc.stdint cimport uint32_t
from libcpp.vector cimport vector
import time
import numpy as np

def fill_randomly(m, size):
    keys = np.random.random_integers(0, 0xFFFFFFFF, size)
    # 0 is a special domain-specific value
    values = np.random.random_integers(1, 0xFFFF, size)
    for j in range(size):
        m[keys[j]] = values[j]

def benchmark_get():
    cdef int dummy
    cdef uint32_t i, j, table_key
    cdef SparseHashMap m
    cdef vector[uint32_t] q_keys
    cdef int NUM_QUERIES = 1000000
    cdef uint32_t MAX_REQUEST = 7448 * 2**19 - 1  # this is domain-specific

    time_start = time.time()

    ### OPTION 1 ###
    m = SparseHashMap('17.shash')

    ### OPTION 2 ###
    # m = SparseHashMap(16130443)
    # fill_randomly(m, 16130443)

    q_keys = np.random.random_integers(0, MAX_REQUEST, NUM_QUERIES)

    print("Initialization: %.3f" % (time.time() - time_start))

    dummy = 0

    time_start = time.time()

    for i in range(NUM_QUERIES):
        table_key = q_keys[i]
        dummy += m.get(table_key)
        dummy %= 0xFFFFFF  # to prevent overflow error

    time_elapsed = time.time() - time_start

    if dummy == 42:
        # So that the unused variable is not optimized away
        print("Wow, lucky!")

    print("Table size: %d" % len(m))
    print("Total time: %.3f" % time_elapsed)
    print("Seconds per query: %.8f" % (time_elapsed / NUM_QUERIES))
    print("Queries per second: %.1f" % (NUM_QUERIES / time_elapsed))

def main():
    benchmark_get()

benchmark.pyxbld（因为pyximport应该在C ++模式下编译）：

def make_ext(modname, pyxfilename):
    from distutils.extension import Extension
    return Extension(
        name=modname,
        sources=[pyxfilename],
        language='c++'
    )

run.py：

import pyximport
pyximport.install()

import benchmark
benchmark.main()

17.shash的结果是：

Initialization: 2.612
Table size: 16130443
Total time: 48.568
Seconds per query: 0.00004857
Queries per second: 20589.8

和随机数据：

Initialization: 25.853
Table size: 16100260
Total time: 0.891
Seconds per query: 0.00000089
Queries per second: 1122356.3

17.shash中的密钥分配是这个（plt.hist(np.fromiter(m.keys(), dtype=np.uint32, count=len(m)), bins=50)）：

从sparsehash和gcc上的文档中可以看出，此处使用了琐碎的哈希（即x哈希到x）。

除了哈希冲突之外，还有什么明显可能导致这种行为吗？根据我的发现，在Cython包装器中集成自定义散列函数（即重载std::hash<uint32_t>）是非常重要的。

Answer 1

我找到了一个有效的解决方案，但它并不漂亮。

sparsehash_wrapper.cpp：

#include "sparsehash/sparse_hash_map"
#include "stdint.h"

// syntax borrowed from
// https://stackoverflow.com/questions/14094104/google-sparse-hash-with-murmur-hash-function

struct UInt32Hasher {
    size_t operator()(const uint32_t& x) const {
        return (x ^ (x << 17) ^ (x >> 13) + 3238229671);
    }    
};

template<class Key, class T>
class sparse_hash_map : public google::sparse_hash_map<Key, T, UInt32Hasher> {};

这是一个自定义哈希函数，我可以使用最少的代码更改集成到现有的包装器中：我只需要将sparsehash/sparse_hash_map替换为Cython sparsehash_wrapper.cpp文件中.pxd的路径。到目前为止，唯一的问题是pyximport无法找到sparsehash_wrapper.cpp，除非我在.pxd中指定完整的绝对路径。

问题确实存在冲突：从头开始重新创建与17.shash相同内容的哈希映射（创建一个空映射并插入来自17.shash的每个（键，值）对进入它），性能上升到1M + req / sec。

sparse_hash_map对于特定数据来说非常慢

1 个答案: