upload file instantly.
Naive Comparison
Upload new file
Go through all stored files
Compare each stored file with new file byte-by-byte
If there's the same file, store a link to it instead of the new file
Drawbacks of Naive Comparison
Have to upload the file first anyway
O(NS)to compare file of size S with N other files
N grows, so total running time of uploads grows asO(N2)
Idea: Compare Hashes
As in Rabin-Karp's algorithm, compare hashes of files first
If hashes are different, files are different
If there's a file with the same hash, upload and compare directly
Drawbacks of Hash Comparison
There can be collisions
Still have to upload the file to compare directly
Still have to compare with all N stored files
Idea: Several Hashes
Choose several different hash functions
Polynomial hashing with different p or x
Compute all hashes for each file
If there's a file with all the same hashes, files are probably equal
Don't upload the new file in this case!
Compute hashes locally before upload
Problem: Collisions
Collisions can happen even for several hashes simultaneously
There are algorithms to find collisions for known hash functions
However, even for one hash function collisions are extremely rare
Using 3 or 5 hashes, you probably won't see a collision in a lifetime
Problem: O(N)Comparisons
Still have to compare withN already stored files
Idea:Precompute Hashes
When a file is submitted for upload,hashes are computed anyway
Store file addresses in a hash table
Also store all the hashes there
Only need the hashes to search in the table
Final Solution
Choose 3 −
5 hash functions
Store file addresses and hashes in a hash table
Compute the hashes of new file locally before upload
Search new file in the hash table
Search is successful if all the hashes coincide
Don't upload the file, store a link to the existing one
More Problems
Billions of les are uploaded daily
Trillions stored already
Too big for a simple hash table
Millions of users upload simultaneously
Too many requests for a single table
Big Data
Need to store trillions or more objects
File addresses, user profiles, e-mails
Need fast search/access
Hash tables provide O(1)search/access on average, but for n = 1012,O(n+ m) memory becomes too big
Solution: distributed hash tables
Distributed Hash Table
Get 1000 computers
Create a hash table on each of them
Determine which computer "owns" object O: number h(O)mod 1000
Send request to that computer,search/modify in the local hash table
Problems
Computers sometimes break
Computer breaks once in 2 years ⇒ one of 1000 computers breaks every day!
Store several copies of the data
Need to relocate the data from the broken computer
Service grows, and new computers are added to the cluster
h(O)mod 1000 no longer works
Consistent Hashing
Choose hash function h with cardinality m and put numbers from 0 tom−1 on a circle clockwise
Each object O is then mapped to apoint on the circle with number h(O)
Map computer IDs to the same circle: compID→ point numberh(compID)
Conclusion
Distributed Hash Tables (DHT) store Big Data on many computers
Consistent Hashing (CH) is one way to determine which computer stores which data
CH uses mapping of keys and computer IDs on a circle
Each computer stores a range of keys
Overlay Network is used to route the data to/from the right computer