Hash Tables: Distributed Hash Tables--Data Structure

最新推荐文章于 2025-08-13 15:50:21 发布

原创最新推荐文章于 2025-08-13 15:50:21 发布 · 264 阅读

·

0

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

文章标签：

#数据结构 #算法 #hash

本文探讨了文件上传过程中去重的技术方案，包括简单的逐字节比较、哈希比对及使用多个哈希函数的方法，并提出了利用分布式哈希表与一致性哈希解决大规模数据存储和检索的问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

upload file instantly.

Naive Comparison

Upload new file
Go through all stored files

Compare each stored file with new file byte-by-byte

If there's the same file, store a link to it instead of the new file

Drawbacks of Naive Comparison

Have to upload the file first anyway

O(NS)to compare file of size S with N other files

N grows, so total running time of uploads grows asO(N2)

Idea: Compare Hashes

As in Rabin-Karp's algorithm, compare hashes of files first

If hashes are different, files are different

If there's a file with the same hash, upload and compare directly

Drawbacks of Hash Comparison

There can be collisions

Still have to upload the file to compare directly

Still have to compare with all N stored files

Idea: Several Hashes

Choose several different hash functions

Polynomial hashing with different p or x

Compute all hashes for each file

If there's a file with all the same hashes, files are probably equal

Don't upload the new file in this case!

Compute hashes locally before upload

Problem: Collisions

Collisions can happen even for several hashes simultaneously

There are algorithms to find collisions for known hash functions

However, even for one hash function collisions are extremely rare

Using 3 or 5 hashes, you probably won't see a collision in a lifetime

Problem: O(N)Comparisons

Still have to compare withN already stored files

Idea:Precompute Hashes

When a file is submitted for upload,hashes are computed anyway

Store file addresses in a hash table

Also store all the hashes there

Only need the hashes to search in the table

Final Solution

Choose 3 − 5 hash functions
Store file addresses and hashes in a hash table
Compute the hashes of new file locally before upload
Search new file in the hash table
Search is successful if all the hashes coincide
Don't upload the file, store a link to the existing one

More Problems

Billions of les are uploaded daily

Trillions stored already
Too big for a simple hash table

Millions of users upload simultaneously

Too many requests for a single table

Big Data

Need to store trillions or more objects

File addresses, user profiles, e-mails

Need fast search/access

Hash tables provide O(1)search/access on average, but for n = 1012,O(n+ m) memory becomes too big

Solution: distributed hash tables

Distributed Hash Table

Get 1000 computers

Create a hash table on each of them

Determine which computer "owns" object O: number h(O)mod 1000

Send request to that computer,search/modify in the local hash table

Problems

Computers sometimes break

Computer breaks once in 2 years ⇒ one of 1000 computers breaks every day!

Store several copies of the data

Need to relocate the data from the broken computer

Service grows, and new computers are added to the cluster

h(O)mod 1000 no longer works

Consistent Hashing

Choose hash function h with cardinality m and put numbers from 0 tom−1 on a circle clockwise

Each object O is then mapped to apoint on the circle with number h(O)

Map computer IDs to the same circle: compID→ point numberh(compID)

Conclusion

Distributed Hash Tables (DHT) store Big Data on many computers

Consistent Hashing (CH) is one way to determine which computer stores which data

CH uses mapping of keys and computer IDs on a circle
Each computer stores a range of keys

Overlay Network is used to route the data to/from the right computer

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。