Bloom Filter Brief Intro

本文深入探讨了Bloom过滤器的工作原理及其应用。通过快速、空间高效的成员资格测试,Bloom过滤器能够在不存储实际数据的情况下减少昂贵的磁盘或网络查找。本文还介绍了JavaScript实现的Bloom过滤器,并通过交互式演示帮助理解其内部机制。

Bloom Filters

Everyone is always raving about bloom filters. But what exactly are they, and what are they useful for?

Operations

The basic bloom filter supports two operations: test and add.

Test is used to check whether a given element is in the set or not. If it returns:

  • false then the element is definitely not in the set.
  • true then the element is probably in the set. The false positive rate is a function of the bloom filter's size and the number and independence of the hash functions used.

Add simply adds an element to the set. Removal is impossible without introducing false negatives, but extensions to the bloom filter are possible that allow removal e.g. counting filters.

Applications

The classic example is using bloom filters to reduce expensive disk (or network) lookups for non-existent keys.

If the element is not in the bloom filter, then we know for sure we don't need to perform the expensive lookup. On the other hand, if itis in the bloom filter, we perform the lookup, and we can expect it to fail some proportion of the time (the false positive rate).

Bloomfilter.js

I wrote a very fast bloom filter implementation in JavaScript called bloomfilter.js. It uses the non-cryptographic Fowler–Noll–Vo hash function for speed. We can get away with using a non-cryptographic hash function as we only care about having a uniform distribution of hashes.

The implementation also uses JavaScript typed arrays if possible, as these are faster when performing low-level bitwise operations.

Interactive Demonstration

Below you should see an interactive visualisation of a bloom filter, powered by bloomfilter.js.

You can add any number of elements (keys) to the filter by typing in the textbox and clicking "Add". Then use the second textbox to see if an element is probably there or definitely not!

Definitely not there.

Explanation

The bloom filter essentially consists of a bit vector of length m, represented by the central column.

To add an item to the bloom filter, we feed it to k different hash functions and set the bits at the resulting positions. In this example, I've setm to50 andk to 3. Note that sometimes the hash functions produce overlapping positions, so less thank positions may be set.

To test if an item is in the filter, again we feed it to the k hash functions. This time, we check to see if any of the bits at these positions are not set. If any are not set, it means the item is definitely not in the set. Otherwise, it is probably in the set.

Implementation Notes

Astute readers will note that I only mentioned one hash function, Fowler–Noll–Vo, but bloom filters require several. It turns out thatyou can produce k hash functions using only two to start off with. In fact, I just perform an additional round of FNV as the second hash function, and this works great. Unfortunately I can't use the 64-bit trick in the linked post as JavaScript only supports bitwise operations on 32 bits.


A Bloom filter is a data structure optimized for fast, space-efficient set membership tests. Bloom filters have the unusual property of requiring constant time to add an element to the set or test for membership, regardless of the size of the elements or the number of elements already in the set. No other constant-space set data structure has this property.

It works by storing a bit vector representing the set S' = {h[i](x) | x in S, i = 1, …, k}, where h[1], …, h[k] := {0, 1} -> [n lg(1/ε) lg e] are hash functions. Additions are simply setting k bits to 1, specifically those at h[1](x), …, h[k](x). Checks are implemented by performing those same hash functions and returning true if all of the resulting positions are 1.

Because the set stored is a proper superset of the set of items added, false positives may occur, though false negatives cannot. The false positive rate can be specified.

Bloom filters offer the following advantages:
  • Space: Approximately n * lg(1/ε), where ε is the false positive rate and n is the number of elements in the set.
    • Example: There are approximately 170k words in the English language. If we consider that to be our set (therefore n = 1.7E5), and we wish to search a corpus for them with a 1% false positive rate, the filter would require about (1.7E5 * lg(1 / 0.01)) ≈ 162 KB. Contrast this with a hashtable, which would require (1.7E5 elements * 32 bits per element) ≈ 664 KB. Obviously explicit string storage would be significantly more.
  • Precision: Arbitrary precision, where increasing precision requires more space (following the above size equation) but not more time.
    • Example: If we wanted to reduce our false positive rate in the above example from one percent to one permille the space requirement would go from about 162 KB to about 207 KB.
  • Time: O(k) where k is the number of hash functions. The optimal number of hash functions (though a different number can be supplied by the user if desired) is ceiling(lg(1/ε))
    • Example: In keeping with our above example, if the accepted false positive rate is 0.001, k = 10.

This implementation uses Dillinger & Manolios double hashing to provide all but the first two hash functions. By default the first hash function is the type's GetHashCode() method. This implementation also includes default secondary hash functions for strings ( Jenkin's "One at a time" method) and integers ( Wang's method).

Bloom filters are due to Burton H. Bloom, as described in the Communications of the ACM in July 1970. The full paper is available here.

Last edited Nov 11, 2014 at 9:38 PM by JustinRussell, version 6


See Also

【轴承故障诊断】基于融合鱼鹰和柯西变异的麻雀优化算法OCSSA-VMD-CNN-BILSTM轴承诊断研究【西储大学数据】(Matlab代码实现)内容概要:本文提出了一种基于融合鱼鹰和柯西变异的麻雀优化算法(OCSSA)优化变分模态分解(VMD)参数,并结合卷积神经网络(CNN)与双向长短期记忆网络(BiLSTM)的轴承故障诊断模型。该方法利用西储大学公开的轴承数据集进行验证,通过OCSSA算法优化VMD的分解层数K和惩罚因子α,有效提升信号分解精度,抑制模态混叠;随后利用CNN提取故障特征的空间信息,BiLSTM捕捉时间序列的动态特征,最终实现高精度的轴承故障分类。整个诊断流程充分结合了信号预处理、智能优化与深度学习的优势,显著提升了复杂工况下轴承故障诊断的准确性与鲁棒性。; 适合人群:具备一定信号处理、机器学习及MATLAB编程基础的研究生、科研人员及从事工业设备故障诊断的工程技术人员。; 使用场景及目标:①应用于旋转机械设备的智能运维与故障预警系统;②为轴承等关键部件的早期故障识别提供高精度诊断方案;③推动智能优化算法与深度学习在工业信号处理领域的融合研究。; 阅读建议:建议读者结合MATLAB代码实现,深入理解OCSSA优化机制、VMD参数选择策略以及CNN-BiLSTM网络结构的设计逻辑,通过复现实验掌握完整诊断流程,并可进一步尝试迁移至其他设备的故障诊断任务中进行验证与优化。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值