Distributed Hash Tables

本文探讨了分布式哈希表(DHT)的设计原理,并通过实例展示了如何实现一个基本的DHT设计,包括节点加入、离开、查找和存储操作。文中还介绍了改进性能的方法,如使用指针表和XOR距离度量,以及Kademlia DHT的特性,如被动更新和可靠性增强。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 in
Distributed hash tables are an essential component of robust peer-to-peer networks. Learn to write applications that let everyone's copy share the same data.

In the world of decentralization, distributed hash tables (DHTs) recently have had a revolutionary effect. The chaotic, ad hoc topologies of the first-generation peer-to-peer architectures have been superseded by a set of topologies with emergent order, provable properties and excellent performance. Knowledge of DHT algorithms is going to be a key ingredient in future developments of distributed applications.

A number of research DHTs have been developed by universities, picked up by the Open Source community and implemented. A few proprietary implementations exist as well, but currently none are available as SDKs; rather, they are embedded in commercially available products. Each DHT scheme generally is pitched as being an entity unto itself, different from all other schemes. In actuality, the various available schemes all fit into a multidimensional matrix. Take one, make a few tweaks and you end up with one of the other ones. Existing research DHTs, such as Chord, Kademlia and Pastry, therefore are starting points for the development of your own custom schemes. Each has properties that can be combined in a multitude of ways. In order to fully express the spectrum of options, let's start with a basic design and then add complexity in order to gain useful properties.

Basically, a DHT performs the functions of a hash table. You can store a key and value pair, and you can look up a value if you have the key. Values are not necessarily persisted on disk, although you certainly could base a DHT on top of a persistent hash table, such as Berkeley DB; and in fact, this has been done. The interesting thing about DHTs is that storage and lookups are distributed among multiple machines. Unlike existing master/slave database replication architectures, all nodes are peers that can join and leave the network freely. Despite the apparent chaos of periodic random changes to the membership of the network, DHTs make provable guarantees about performance.

To begin our exploration of DHT designs, we start with a circular, double-linked list. Each node in the list is a machine on the network. Each node keeps a reference to the next and previous nodes in the list, the addresses of other machines. We must define an ordering so we can determine what the “next” node is for each node in the list. The method used by the Chord DHT to determine the next node is as follows: assign a unique random ID of k bits to each node. Arrange the nodes in a ring so the IDs are in increasing order clockwise around the ring. For each node, the next node is the one that is the smallest distance clockwise away. For most nodes, this is the node whose ID is closest to but still greater than the current node's ID. The one exception is the node with the greatest ID, whose successor is the node with the smallest ID. This distance metric is defined more concretely in the distance method (Listing 1).

Each node is itself a standard hash table. All you need to do to store or retrieve a value from the hash table is find the appropriate node in the network, then do a normal hash table store or lookup there. A simple way to determine which node is appropriate for a particular key (the one Chord uses) is the same as the method for determining the successor of a particular node ID. First, take the key and hash it to generate a key of exactly k bits. Treat this number as a node ID, and determine which node is its successor by starting at any point in the ring and working clockwise until a node is found whose ID is closest to but still greater than the key. The node you find is the node responsible for storage and lookup for that particular key (Listing 2). Using a hash to generate the key is beneficial because hashes generally are distributed evenly, and different keys are distributed evenly across all of the nodes in the network.

This DHT design is simple but entirely sufficient to serve the purpose of a distributed hash table. Given a static network of nodes with perfect uptime, you can start with any node and key and find the node responsible for that key. An important thing to keep in mind is that although the example code so far looks like a fairly normal, doubly linked list, this is only a simulation of a DHT. In a real DHT, each node would be on a different machine, and all calls to them would need to be communicated over some kind of socket protocol.

In order to make the current design more useful, it would be nice to account for nodes joining and leaving the network, either intentionally or in the case of failure. To enable this feature, we must establish a join/leave protocol for our network. The first step in the Chord join protocol is to look up the successor of the new node's ID using the normal lookup protocol. The new node should be inserted between the found successor node and that node's predecessor. The new node is responsible for some portion of the keys for which the predecessor node was responsible. In order to ensure that all lookups work without fail, the appropriate portion of keys should be copied to the new node before the predecessor node changes its next node pointer to point to the new node.

Leaves are very simple; the leaving node copies all of its stored information to its predecessor. The predecessor then changes its next node pointer to point to the leaving node's successor. The join and leave code is similar to the code for inserting and removing elements from a normal linked list, with the added requirement of migrating data between the joining/leaving nodes and their neighbors. In a normal linked list, you remove a node to delete the data it's holding. In a DHT, the insertion and removal of nodes is independent of the insertion and removal of data. It might be useful to think of DHT nodes as similar to the periodically readjusting buckets used in persistent hash table implementations, such as Berkeley DB.

Allowing the network to have dynamic members while ensuring that storage and lookups still function properly certainly is an improvement to our design. However, the performance is terrible—O(n) with an expected performance of n/2. Each node traversed requires communication with a different machine on the network, which might require the establishment of a TCP/IP connection, depending on the chosen transport. Therefore, n/2 traversed nodes can be quite slow.

In order to achieve better performance, the Chord design adds a layer to access O(log n) performance. Instead of storing a pointer to the next node, each node stores a “finger table” containing the addresses of k nodes. The distance between the current node's ID and the IDs of the nodes in the finger table increases exponentially. Each traversed node on the path to a particular key is closer logarithmically than the last, with O(log n) nodes being traversed overall.

In order for logarithmic lookups to work, the finger table needs to be kept up to date. An out-of-date finger table doesn't break lookups as long as each node has an up-to-date next pointer, but lookups are logarithmic only if the finger table is correct. Updating the finger table requires that a node address is found for each of the k slots in the table. For any slot x, where x is 1 to k, finger[x] is determined by taking the current node's ID and looking up the node responsible for the key (id+2(x-1)) mod(2k) (Listing 3). When doing lookups, you now have k nodes to choose from at each hop, instead of only one at each. For each node you visit from the starting node, you follow the entry in the finger table that has the shortest distance to the key (Listing 4).

So far we have more or less defined the original version of the Chord DHT design as it was described by the MIT team that inventedit. This is only the tip of the iceberg in the world ofDHTs, though. Many modifications can be made to establish differentproperties from the ones described in the original Chord paper, withoutlosing the logarithmic performance and guaranteed lookups that Chordprovides.

One property that might be useful for a DHT is the ability to updatethe finger table passively, requiring periodic lookups to be done inorder to refresh the table. With MIT Chord, you must do a lookup,hitting O(log n) nodes for all k items in the finger table, which canresult in a considerable amount of traffic. It would be advantageous ifa node could add other nodes to its finger table when they contacted itfor lookups. As a conversation already has been established in orderto do the lookup, there is little added overhead in checking to seeif the node doing the lookup is a good candidate for the local fingertable. Unfortunately, finger table links in Chord are unidirectionalbecause the distance metric is not symmetrical. A node generally is notgoing to be in the finger tables of the nodes in its finger table.

A solution to this problem is to replace Chord's modular additiondistance metric with one based on XOR. The distance between two nodes, Aand B, is defined as the XOR of the node IDs interpreted as the binaryrepresentation of an unsigned integer (Listing 5). XOR makes adelightful distance metric because it is symmetric. Because distance(A, B)== distance(B, A), for any two nodes, if A is in B's finger table then B isin A's finger table. This means nodes can update their finger tablesby recording the addresses of nodes that query them, reducing significantly theamount of node update traffic. It also simplifies codinga DHT application, because you don't need to keep a separate thread tocall the update method periodically. Instead, you simply update wheneverthe lookup method is called.

An issue with the design presented so far is the paths to a givennode are fragile. If any node in the path refuses to cooperate,the lookup is stuck. Between any two nodes there is exactly one path,so routing around broken nodes is impossible. The Kademlia DHT solvesthis by widening the finger table to contain a bucket ofj references foreach finger table slot instead of only one, where j is defined globallyfor the whole network. Now j different choices areavailable for eachhop, so there are somewhere aroundj*log(n) possible paths. Thereareless than that, though, paths converge as they get closer to the target.But, the number of possible paths probably is greater than 1,so this is an improvement.

Kademlia goes further and orders the nodes in the bucket in terms ofrecorded uptime. Older nodes are given preference for queries, and newreferences are added only if there are not enough old nodes. Besides theincreased reliability of queries, this approach offers the added benefitthat an attack on the network in which new nodes are created rapidly inorder to push out good nodes will fail—it won't even be noticeable.

It's important to understand that these different properties are nottied to a particular DHT implementation. We gradually have built up aDHT design from scratch, developed it into something that resemblesChord, then modified it to be more like Kademlia. The differentapproaches can be more or less mixed and matched. Your finger tablebuckets can have 1 slot or j slots, depending onwhether you use modular addition orXOR for your distance metric. You can always follow the closest node, oryou can rank them according to uptime or according to some othercriteria. You can draw from several other DHT designs,such as Pastry, OceanStore and Coral. You also can use your own ideasto devise the perfect design for your needs. Myself, I have concoctedseveral modifications to a base Chord design to add properties such asanonymity, Byzantine fault-tolerant lookups, geographic routing andthe efficient broadcasting of messages to enter the network. It's fun to doand easier than you think.

Now that you know how to create your own DHT implementations, you'reprobably wondering what kind of crazy things you can do with thiscode. Although there probably are many applications for DHTs that I haven'tthought of yet, I know people already are working on suchprojects as file sharing, creating a shared hard drive for backing updata, replacing DNS with a peer-to-peer name resolution system, scalablechat and serverless gaming.

For this article, I've tied the code together into a fun littleexample application that might be of interest to readers who caughtmy interview on the Linux Journal Web site about peer-to-peerSuperworms (see Resources). The application is a distributed port scannerthat storesresults in the simulated DHT (Listing 6). Given a fully functionalDHT implementation, this script would have some handy properties. First,it allows multiple machines to contribute results to a massive scanningof the Internet. This way, all of the scanning activity can't be linkedwith a single machine. Additionally, it avoids redundant scanning. Ifthe host already has been scanned, the results are fetched from the DHT,avoiding multiple scans. No central server is required to hold all ofthe results or to coordinate the activities of the participants. Thisapplication may seem somewhat insidious, but the point is it wastrivial to write given the DHT library. The same approach can be usedin other sorts of distributed projects.

In this installment of our two-part series,we discussed the theory behind building DHTs. Next time, we'll talkabout practical issues in using DHTs in real-world applications.

Brandon Wiley is a peer-to-peer hacker and current president of theFoundation for Decentralization Research, a nonprofit corporationempowering people through technology. He can be contacted atblanu@decentralize.org.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值