hash算法学习笔记

最新推荐文章于 2023-04-16 22:53:18 发布

原创最新推荐文章于 2023-04-16 22:53:18 发布 · 2.9k 阅读

4 ·

CC 4.0 BY-SA版权

算法专栏收录该内容

3 篇文章

订阅专栏

hash表作为一种快速查找的数据结构，在希望提高程序性能的场所使用非常广泛，理论上希望它提供的查找复杂度是O(1)。但是如果遇到hash冲突的情况，那么hash表的查找就会慢下来。解决hash冲突的方式包括开放定址、共享池、链表等。显然hash表的性能依赖于hash表中key的hash算法的好坏，一个好的hash算法应该尽可能使得hash(key)各不相同，从而使得hash表的查找性能维持在O(1)这个复杂度上。

一、字符串hash算法

1.1 java中的字符串hash算法(BKDRHash)

Java中每个对象都有一个hashcode方法，主要作用是使得这些对象与容器配合使用，提供一个可供hash的整形数值给容器使用。很显然不同对象的hashcode应该不一样，这样才能保证容器hash处理这个hashcode的时候不会出现大量冲突。下面是java中String对象的hashcode实现方式。主要代码段：

for (int i=0; i<len; i++)

{

h = 31*h + val[i];

}

对于一个字符串s[n-1 - 0]，计算出来的hashcode是：

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

注意：计算出来的hashcode可能是负数

1.2 time33 hash算法

time33算法就是不停的将每个字符乘以33（Apache，PHP，Perl都使用了该算法），形式如下：

unsigned long hash = 0;

for (int i=0; i<len; i++)

{

hash = 33*hash + str[i];

}

PHP中hash初值选择的是5381。

1.3 google提供的字符串hash算法

该算法的下载地址：http://code.google.com/p/cityhash/

这个算法集合功能包括将字符串散列成无符号32位数，无符号64位数，无符号128位数。

算法部分接口如下：

uint64 CityHash64(const char *buf, size_t len)

uint128 CityHash128(const char *s, size_t len)

uint32 CityHash32(const char *buf, size_t len)

据说该算法利用了现代CPU的某些特点，因此计算散列值的速度比较快。

1.4 STL中的字符串hash算法

STL中的hash_map底层使用的是hashtable，而不是一个红黑树。hash_map对key的默认hash方式是：

struct hash<basic_string<_CharT,_Traits,_Alloc> > {

size_t operator()(const basic_string<_CharT,_Traits,_Alloc>& __s) const

{ return __stl_string_hash(__s); }

}

__stl_string_hash函数内容如下：

template <class _CharT, class _Traits, class _Alloc>

size_t __stl_string_hash(const basic_string<_CharT,_Traits,_Alloc>& __s) {

unsigned long __h = 0;

for (basic_string<_CharT,_Traits,_Alloc>::const_iterator __i = __s.begin();

__i != __s.end();

++__i)

__h = 5*__h + *__i;

return size_t(__h);

}

从上面可以看出，STL中默认计算hash的方式是:

hash[i] = 5*hash[i-1] + str[i]

1.5 FNV哈希算法

该算法对于非常相近的字符串效果很好（比如URL，IP地址等），可以保持较小的冲突率。它有两个版本FNV1和FNV1a，下面是各自算法的hash过程。

FNV1:

hash = offset_basis

for each octet_of_data to be hashed

hash = hash * FNV_prime

hash = hash xor octet_of_data

return hash

FNV1a:

hash = offset_basis

for each octet_of_data to be hashed

hash = hash xor octet_of_data

hash = hash * FNV_prime

return hash

offset_basis，FNV_prime这个两个参数对于生成不同位数的hash有不同的取值，下面是它们不同情况下的取值：

FNV_prime的取值如下：

32 bit FNV_prime = 224 + 28 + 0x93 = 16777619

64 bit FNV_prime = 240 + 28 + 0xb3 = 1099511628211

128 bit FNV_prime = 288 + 28 + 0x3b = 309485009821345068724781371

offset_basis的取值如下：

32 bit offset_basis = 2166136261

64 bit offset_basis = 14695981039346656037

128 bit offset_basis = 144066263297769815596495629667062367629

更多详细的算法介绍可以到该页面阅读：http://www.isthe.com/chongo/tech/comp/fnv/

二、整数hash算法

2.1 java中的int类型整数的hash算法

由于一些用户创建的对象可能实现的hashcode不够好，所以HashMap得到对象的hashcode之后，还会对这个hashcode进行hash处理，才会计算出对象存放在哪个桶中（HashMap解决冲突的方式是链表法）。下面是HashMap中hash函数的主要代码段：

static int hash(int h)

{

h ^= (h >>> 20) ^ (h >>> 12);

return h ^ (h >>> 7) ^ (h >>> 4);

}

2.2 Wang/Jenkins hash算法

1. 无符号64位整数版本

uint64_t hash(uint64_t key) {

key = (~key) + (key << 21); // key = (key << 21) - key - 1;

key = key ^ (key >> 24);

key = (key + (key << 3)) + (key << 8); // key * 265

key = key ^ (key >> 14);

key = (key + (key << 2)) + (key << 4); // key * 21

key = key ^ (key >> 28);

key = key + (key << 31);

return key;

}

2. Java ConcurrentHashMap中使用的版本

private static int hash(int h) {

// Spread bits to regularize both segment and index locations,

// using variant of single-word Wang/Jenkins hash.

h += (h << 15) ^ 0xffffcd7d;

h ^= (h >>> 10);

h += (h << 3);

h ^= (h >>> 6);

h += (h << 2) + (h << 14);

return h ^ (h >>> 16);

}