Question

我正在使用此哈希函数，但遇到了很多冲突。目的是添加元素的ascii值并输出该值。有什么方法可以优化此功能或其他功能以减少碰撞次数？

int hash(char* s)
{
    int hash = 0;
    while(*s)
    {
        hash = hash + *s;
        s++;
    }
    return hash;
}

Answer 1

32位int的范围超过40亿。（如果您的int是64位的，则范围会更大。）但是您的代码只是将字符串中每个字符的值相加，因此永远不会超出上限。您所有的哈希码都将是较小的数字，拥挤了可能的值的下限，并增加了发生冲突的机会。

这就是为什么好的算法会比这更复杂的原因。

Here's one article出现在Google的快速搜索中。

Answer 2

“ foo bar”和“ bar foo”哈希值相同吗？实现这种方式的目的是使用ascii值及其在字符串中的位置来计算哈希，我天真地想象这将大大减少冲突。

int hash(char* s)
{
    int hash = 0;
    int pos = 0;
    while(*s)
    {
        pos++;
        hash += (*s * pos);
        s++;
    }
    return hash;
}

尝试一下，看看是否有帮助。我对此答案没有太多的理论知识。

如下所述，使用

EDIT *，您可能希望哈希为无符号整数。我在codechef.com上进行了测试，这是源代码和结果：

#include <stdio.h>

unsigned int hash(char* s);
unsigned int hash2(char* s);

int main(void) {
    unsigned int temp1 = hash("foo bar");
    unsigned int temp2 = hash("bar foo");

    printf("temp1 is %d and temp2 is %d\n",temp1, temp2);

    temp1 = hash2("foo bar");
    temp2 = hash2("bar foo");

    printf("temp1 is %d and temp2 is %d\n",temp1, temp2);

    return 0;
}

unsigned int hash(char* s)
{
    unsigned int hash = 0;
    while(*s)
    {
        hash = hash + *s;
        s++;
    }
    return hash;
}

unsigned int hash2(char* s)
{
    unsigned int hash = 0;
    int pos = 0;
    while(*s)
    {
        pos++;
        hash += (*s * pos);
        s++;
    }
    return hash;
}

输出：

temp1为665，temp2为665

temp1为2655，temp2为2715

Answer 3

是的，您的“哈希”函数将对包含相同字母的字符串（例如“铁路安全”和“童话”）产生冲突。这是因为您仅使用可交换的加法。

您可以使用像素数这样的东西。

unsigned long int hashBetter(const char* s)
{
    unsigned long int hash = 1234567890ul;
    while(*s)
    {
        hash = (*s + hash) * 4294967291ul;
        s++;
    }
    return hash;
}

或者您涉及一个CRC，它将输入数据广泛分布在可能的哈希值的有效范围内：

unsigned long int hashGood(const char* s)
{
    unsigned long int hash = 1234567890ul;
    while(*s)
    {
        hash = crc(hash, *s);
        s++;
    }
    return hash;
}

散列函数可减少冲突

3 个答案: