布隆过滤器-----时间+空间

最新推荐文章于 2025-01-20 16:40:30 发布

_从未止步

最新推荐文章于 2025-01-20 16:40:30 发布

阅读量1.2k

点赞数

CC 4.0 BY-SA版权

分类专栏：我的一些小研究文章标签：哈希 bitmap 布隆过滤器高效查找+高效空间

本文链接：https://blog.youkuaiyun.com/zr1076311296/article/details/51395385

我的一些小研究专栏收录该内容

35 篇文章

订阅专栏

布隆过滤器结合了哈希表的快速查找和位图的高效空间利用，通过多个哈希函数降低冲突，用于判断元素可能的存在状态。在处理大数据时，例如谷歌用于屏蔽骚扰邮箱，它能节省大量存储空间。然而，随着置为1的位增多，错误率也会增加，使用时需考虑空间与误判的平衡。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近在研究哈希表，写了闭散列和开散列的代码，哈希表的查找效率确实比较高，但是我发现哈希表的空间效率却有点低，一个好点的哈希表估计能达到50%的空间利用率吧。

我们知道STl中有一种数据结构叫做 bitset 个人觉得叫 bitmap 会更形象一点（这里不去关系这个了），位图是具有很高的空间效率的数据结构，在处理有些大数据的时候能够发挥出很大的作用。

看到这里应该懂我了吧，既然一个时间快，一个节约空间，那么你们在一起吧----------这就是布隆过滤器！一种时间和空间都比较高效的数据结构。布隆过滤器的设计原理是：使用 k 个哈希函数对同一个key进行定址，在bitmap中将各个哈希函数产生的index都置为1，这样就大大降低了哈希冲突（哈希冲突是不能完全避免的），此话怎讲呢？当我们去查找一个key是否存在的时候，也是通过使用一系列的哈希函数产生index，查看该index对应的位图位置是否为1，来判断该key是否出现过。结论如下：

1.若有哈希函数产生的 index 对应的位置不为 1 ，则说明该 key 没有出现过

2.若所有哈希函数产生的 index 对应的位置都为 1，我们可以推测该 key 有很大的概率出现过，因为哈希函数产出的 index 有一定的错误率的

布隆过滤器的应用还是比较广泛的，比如谷歌就使用到了布隆过滤来屏蔽骚扰邮箱，因为光光靠哈下表不能满足需求，举例说明一下吧，十亿个邮箱的URL所需要的存储空间大约16G吧，这么大的内存一般计算机是不能达到的，而使用布隆过滤器就能很好的解决问题了。不过布隆过滤器还是有缺点的，就是当bitmap中被置为1的位的个数接近总数目的时候，就容易产生错误，所以使用的时候要估计一下空间问题。

我实现了一个简单的布隆过滤器，用来判断string是否出现在其中，我使用了5个哈希函数来产生index，这些哈希函数都是一些经过测试的比较高效的字符哈希函数，我的项目结构如下：

其中包括了三个头文件BitMap，BloomFilter，commom：BitMap实现了一个位图，commom中包含了一些可以复用的字符串哈希函数，BloomFilter中实现布隆过滤器：

BitMap.h

#pragma once
#include <iostream>
#include <vector>
using namespace std;

class BitMap
{
public:
	BitMap(size_t size)
	{
		_array.resize((size>>5) + 1);
	}

	void Set(size_t num)
	{
		size_t index = num>>5;
		size_t pos = num % 32;
		_array[index] |= (1 << pos);
	}

	void Reset(size_t num)
	{
		size_t index = num>>5;
		size_t pos = num % 32;
		_array[index] &= ~(1 << pos);
	}

	bool Test(size_t num)
	{
		size_t index = num>>5;
		size_t pos = num % 32;
		return _array[index] & (1 << pos);
	}

	~BitMap()
	{}

private:
	vector<size_t> _array;	
};

commom.h

/*	
 *	commom 头文件主要存放公共接口函数，提高代码的复用性，例如常见的字符哈系函数
 *
 *create by admin-zou in 2015/5/13
 * 
*/

#pragma once

//素数表
const int HashSize = 28;
static const unsigned long PrimeList[HashSize] = {
	53ul, 97ul, 193ul, 389ul, 769ul, 1543ul, 3079ul, 6151ul,12289ul,
	24593ul, 49157ul, 98317ul, 196613ul, 393241ul, 786433ul,1572869ul,
	3145739ul, 6291469ul, 12582917ul, 25165843ul, 50331653ul,100663319ul, 
	201326611ul, 402653189ul, 805306457ul, 1610612741ul, 3221225473ul,4294967291ul
};

size_t GetNextPrimeNum(size_t size)
{
	 for(size_t i = 0; i < HashSize; ++i)
	 {
  			 if(size < PrimeList[i])
			{
                 return PrimeList[i];
            }
	 }

	 return PrimeList[HashSize-1];
}


// 字符哈系函数
template<class K>
size_t BKDRHash(const K *str)
{
		register size_t hash = 0;
		while (size_t ch = (size_t)*str++)	
		{
				hash = hash * 131 + ch;   // 也可以乘以31、131、1313、13131、131313..            
		}
		return hash;
}
template<class K>
struct hashFunc1
{
	size_t operator() (const char* key)
	{
		return BKDRHash(key);		
	}
};

template<class K>
size_t SDBMHash(const K *str)
{
		register size_t hash = 0;
		while (size_t ch = (size_t)*str++)
		{
				hash = 65599 * hash + ch;
				//hash = (size_t)ch + (hash << 6) + (hash << 16) - hash;  
		}
		return hash;
}
template<class K>
struct hashFunc2
{
	size_t operator() (const char* key)
	{
		return SDBMHash(key);
	}
};

template<class K>
size_t RSHash(const K *str)
{
	register size_t hash = 0;
	size_t magic = 63689;
	while (size_t ch = (size_t)*str++)
	{
			hash = hash * magic + ch;
			magic *= 378551;
	}
	return hash;
}
template<class K>
struct hashFunc3
{
	size_t operator() (const char* key)
	{
		return RSHash(key);
	}
};

template<class K>
size_t APHash(const K *str)
{
	register size_t hash = 0;
	size_t ch;
	for (long i = 0; ch = (size_t)*str++; i++)
	{
			if ((i & 1) == 0)
			{
					hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
			}
			else
			{
					hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
			}
	}
	return hash;
}
template<class K>
struct hashFunc4
{
	size_t operator() (const char* key)
	{
		return APHash(key);
	}
};

template<class K>
size_t JSHash(const K *str)
{
	if (!*str)        // 这是由本人添加，以保证空字符串返回哈希值0  
		return 0;
	register size_t hash = 1315423911;
	while (size_t ch = (size_t)*str++)
	{
			hash ^= ((hash << 5) + ch + (hash >> 2));
	}
	return hash;
}
template<class K>
struct hashFunc5
{
	size_t operator() (const char* key)
	{
		return JSHash(key);
	}
};

最后

BloomFilter.h

/*                            布隆过滤器
 * 
 * Bloom Filter 是一种空间效率很高的随机数据结构，Bloom filter 可以看做是对bit-map 的扩展, 它的原理是：
 *当一个元素被加入集合时，通过K个Hash函数将这个元素映射成一个位阵列（Bit array）中的 K 个点，把它们置为 1。检索时，我们只要看看这些点是不是都是 1 就（大约）知道集合中有没有它了：
 *    如果这些点有任何一个 0，则被检索元素一定不在；
 *    如果都是 1，则被检索元素很可能在。
 * 
 * create by admin-zou in 2016/5/13
*/

#pragma once
#include <iostream>
#include <string>
#include "BitMap.h"
#include "common.h"
using namespace std;

template <class K, 
		 class HashFuncer1 = hashFunc1<K>,
		 class HashFuncer2 = hashFunc2<K>,
		 class HashFuncer3 = hashFunc3<K>,
		 class HashFuncer4 = hashFunc4<K>,
		 class HashFuncer5 = hashFunc5<K> 
		 >
class BloomFiter
{
public:
	BloomFiter(size_t size)
			:_capacity(GetNextPrimeNum(size))
			 ,_bt(_capacity)
		{}

	void Add(const K& key)
	{
		size_t index1 = HashFuncer1()(key.c_str()) % _capacity;
		_bt.Set(index1);
		//cout<<index1<<endl;

		size_t index2 = HashFuncer2()(key.c_str()) % _capacity;
		_bt.Set(index2);
		//cout<<index2<<endl;
		
		size_t index3 = HashFuncer3()(key.c_str()) % _capacity;
		_bt.Set(index3);
		//cout<<index3<<endl;
		
		size_t index4 = HashFuncer4()(key.c_str()) % _capacity;
		_bt.Set(index4);
		//cout<<index4<<endl;
		
		size_t index5 = HashFuncer5()(key.c_str()) % _capacity;
		_bt.Set(index5);
		//cout<<index5<<endl;
	}

	bool Check(const K& key)
	{
		size_t index1 = HashFuncer1()(key.c_str()) % _capacity;
		//cout<<index1<<endl;
		if(! _bt.Test(index1))
		{
			return false;
		}

		size_t index2 = HashFuncer2()(key.c_str()) % _capacity;
		//cout<<index2<<endl;
		if(! _bt.Test(index2))
		{
			return false;
		}

		size_t index3 = HashFuncer3()(key.c_str()) % _capacity;
		//cout<<index3<<endl;
		if(! _bt.Test(index3))
		{
			return false;
		}

		size_t index4 = HashFuncer4()(key.c_str()) % _capacity;
		//cout<<index4<<endl;
		if(! _bt.Test(index4))
		{
			return false;
		}

		size_t index5 = HashFuncer5()(key.c_str()) % _capacity;
		//cout<<index5<<endl;
		if(! _bt.Test(index5))
		{
			return false;
		}
	
		return true;
	}

private:
	size_t  _capacity;
	BitMap  _bt;
};

以上就是布隆过滤器的实现了，我觉得我们学以致用，在以后的 IT 生涯中要会使用它。

--Just do IT