Hash Tables: Distributed Hash Tables--Data Structure

本文探讨了文件上传过程中去重的技术方案,包括简单的逐字节比较、哈希比对及使用多个哈希函数的方法,并提出了利用分布式哈希表与一致性哈希解决大规模数据存储和检索的问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

upload file instantly.


Naive Comparison

  Upload new file
  Go through all stored files

  Compare each stored file with new file byte-by-byte

  If there's the same file, store a link to it instead of the new file 


Drawbacks of Naive Comparison

  Have to upload the file first anyway 

  O(NS)to compare file of size S with other files

  N grows, so total running time of uploads grows asO(N2


Idea: Compare Hashes

  As in Rabin-Karp's algorithm, compare hashes of files first

  If hashes are different, files are different

  If there's a file with the same hash, upload and compare directly 


Drawbacks of Hash Comparison

  There can be collisions

  Still have to upload the file to compare directly

  Still have to compare with all N stored files 


Idea: Several Hashes

  Choose several different hash functions

  Polynomial hashing with different p or x

  Compute all hashes for each file

  If there's a file with all the same hashes, files are probably equal

  Don't upload the new file in this case! 

  Compute hashes locally before upload 


Problem: Collisions

  Collisions can happen even for several hashes simultaneously

  There are algorithms to find collisions for known hash functions

  However, even for one hash function collisions are extremely rare

  Using 3 or 5 hashes, you probably won't see a collision in a lifetime 


Problem: O(N)Comparisons

  Still have to compare withN already stored files 


Idea:Precompute Hashes

  When a file is submitted for upload,hashes are computed anyway

  Store file addresses in a hash table 

  Also store all the hashes there

  Only need the hashes to search in the table 


Final Solution

  Choose 3 5 hash functions
  Store file addresses and hashes in a hash table
  Compute the hashes of new file locally before upload
  Search new file in the hash table
  Search is successful if all the hashes coincide
  Don't upload the file, store a link to the existing one 


More Problems

  Billions of les are uploaded daily 

  Trillions stored already
  Too big for a simple hash table

  Millions of users upload simultaneously

  Too many requests for a single table


Big Data

Need to store trillions or more objects

File addresses, user profiles, e-mails

Need fast search/access

Hash tables provide O(1)search/access on average, but for n = 1012,O(n+ mmemory becomes too big

Solution: distributed hash tables 


Distributed Hash Table

Get 1000 computers

Create a hash table on each of them

Determine which computer "owns" object O: number h(O)mod 1000

Send request to that computer,search/modify in the local hash table 


Problems

Computers sometimes break

Computer breaks once in 2 years one of 1000 computers breaks every day!

Store several copies of the data

Need to relocate the data from the broken computer

Service grows, and new computers are added to the cluster

h(O)mod 1000 no longer works 


Consistent Hashing

Choose hash function h with cardinality m and put numbers from 0 tom1 on a circle clockwise

Each object O is then mapped to apoint on the circle with number h(O)

Map computer IDs to the same circle: compIDpoint numberh(compID)

 

Conclusion

Distributed Hash Tables (DHT) store Big Data on many computers

Consistent Hashing (CH) is one way to determine which computer stores which data

CH uses mapping of keys and computer IDs on a circle
Each computer stores a range of keys

Overlay Network is used to route the data to/from the right computer 




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值