TheRings 环
The rings determine where data should reside in thecluster. There is a separate ring for account databases, container databases,and individual objects but each ring works in the same way. These rings areexternally managed, in that the server processes themselves do not modify therings, they are instead given new rings modified by other tools.
环决定数据在集群中的位置。帐号数据库、容器数据库和单个对象的环都有独立的环管理,不过每个环均以相同的方式工作。这些环被外部工具管理,服务器进程并不修改环,而是由其他工具修改并传送新的环。
The ring uses a configurable number of bits from a path’sMD5 hash as a partition index that designates a device. The number of bits keptfrom the hash is known as the partition power, and 2 to the partition powerindicates the partition count. Partitioning the full MD5 hash ring allows otherparts of the cluster to work in batches of items at once which ends up eithermore efficient or at least less complex than working with each item separatelyor the entire cluster all at once.
环从路径的MD5哈希值中使用可配置的比特数,该比特位作为一个虚节点的索引来指派设备。从该哈希值中保留的比特数称为虚节点的幂,并且2的虚节点的幂次方表示虚节点的数量。使用完全MD5哈希值来划分,环允许集群的其他组件一次以分批的项来工作,这将更有效率地完成,或者至少比独立地处理每一个项或者整个集群同时工作的复杂度更低。
Another configurable value is the replica count, whichindicates how many of the partition->device assignments comprise a singlering. For a given partition number, each replica’s device will not be in thesame zone as any other replica’s device. Zones can be used to group devicesbased on physical locations, power separations, network separations, or anyother attribute that would lessen multiple replicas being unavailable at thesame time.
另一个可配置的值是副本数量,表示有多少个虚节点->设备分派来构成单个环。给定一个虚节点编号,每个副本的设备将不会与其它副本的设备在同一个区域内。区域可以基于物理位置、电力分隔、网络分隔或者其它可以减少多个副本在同个时间点上失效的属性用来聚合设备。
Ring Builder环构造器
The rings are built and managed manually by a utilitycalled the ring-builder. The ring-builder assigns partitions to devices andwrites an optimized Python structure to a gzipped, serialized file on disk forshipping out to the servers. The server processes just check the modificationtime of the file occasionally and reload their in-memory copies of the ringstructure as needed. Because of how the ring-builder manages changes to thering, using a slightly older ring usually just means one of the three replicasfor a subset of the partitions will be incorrect, which can be easily workedaround.
使用工具ring-builder来手动地构建和管理环。ring-builder将虚节点分配到设备并且生成一个优化的Python结构,之后打包(gzipped)、序列化(pickled),保存到磁盘上,用以服务器的传送。服务器进程只是不定时地检测文件的修改时间,如果需要就重新加载环结构在内存中的拷贝。因为ring-builder管理环的变化的方式,使用一个稍旧的环仅意味对于的一小部分的虚节点,它的3个副本中的一个不正确,这还是容易解决的。
The ring-builder also keeps its own builder file with thering information and additional data required to build future rings. It is veryimportant to keep multiple backup copies of these builder files. One option isto copy the builder files out to every server while copying the ring filesthemselves. Another is to upload the builder files into the cluster itself.Complete loss of a builder file will mean creating a new ring from scratch,nearly all partitions will end up assigned to different devices, and thereforenearly all data stored will have to be replicated to new locations. So,recovery from a builder file loss is possible, but data will definitely beunreachable for an extended time.
ring-builder也存有它本身关于环信息的构造器文件和额外所需用来构建新环的数据。保存多份构建器文件的备份拷贝非常重要。一种选择是当复制这些环文件时,复制这些构造器文件到每个服务器上。另一种这是上传构造器文件到集群中。构造器文件的完整性受损将意味着要重新创建一些新的环,几乎所有的虚节点将最终分配到不同的设备,因此几乎所有的数据将不得不复制到新的位置上。所以,从一个受损的构建器文件恢复是有可能的,但是会造成数据在一段时间内不可用。
Ring Data Structure 环数据结构
The ring data structure consists of three top levelfields: a list of devices in the cluster, a list of lists of device idsindicating partition to device assignments, and an integer indicating thenumber of bits to shift an MD5 hash to calculate the partition for the hash.
环的数据结构由三个顶层域组成:在集群中设备的列表;设备id列表的列表,表示虚节点到设备的指派;以及表示MD5hash值位移的位数来计算该哈希值对应的虚节点。
List of Devices 设备列表
The list of devices is known internally to the Ring classas devs. Each item in the list of devices is a dictionary with the followingkeys:
设备的列表在Ring类内部被称为devs。设备列表中的每一项为带有以下键的字典:
id |
integer |
The index into the list devices. 所列设备中的索引 |
zone |
integer |
The zone the devices resides in. 设备所在的区域 |
weight |
float |
The relative weight of the device in comparison to other devices. This usually corresponds directly to the amount of disk space the device has compared to other devices. For instance a device with 1 terabyte of space might have a weight of 100.0 and another device with 2 terabytes of space might have a weight of 200.0. This weight can also be used to bring back into balance a device that has ended up with more or less data than desired over time. A good average weight of 100.0 allows flexibility in lowering the weight later if necessary. 该设备与其他设备的相对权重。这常常直接与设备的磁盘空间数量和其它设备的磁盘空间数量的比有关。例如,一个1T大小的设备有100的权重而一个2T大小的磁盘将有200的权重。这个权重也可以被用于恢复一个超出或少于所需数据的磁盘。一个良好的平均权重100考虑了灵活性,如果需要日后可以降低该权重。 |
ip |
string |
The IP address or hostname of the server containing the device. 包含该设备的服务器IP地址 |
port |
int |
The TCP port the listening server process uses that serves requests for the device. 服务器进程所使用的TCP端口用来提供该设备的服务请求 |
device |
string |
The on disk name of the device on the server. For example: sdb1 服务器上设备的磁盘名称。例如:sdb1 |
meta |
string |
A general-use field for storing additional information for the device. This information isn’t used directly by the server processes, but can be useful in debugging. For example, the date and time of installation and hardware manufacturer could be stored here. 存储设备额外信息的通用字段。该信息并不直接被服务器进程使用,但是在调试时会派上用场。例如,安装的日期和时间和硬件生产商可以存储在这。 |
Note: The list of devices may contain holes, or indexesset to None, for devices that have been removed from the cluster. Generally,device ids are not reused. Also, some devices may be temporarily disabled bysetting their weight to 0.0. To obtain a list of active devices (for uptime polling,for example) the Python code would look like: devices = [device for device in self.devs if device anddevice['weight']]
注意:设备的列表可能包含了holes,或设为None的索引,表示已经从集群移除的设备。一般地,设备的id不会被重用。一些设备也可以通过设置权重为0.0来暂时地被禁用。为了获得有效设备的列表(例如,用于运行时间轮询),Python代码如下:devices = [device for device in self.devs if device anddevice['weight']]
Partition Assignment List虚节点分配列表
This is a list of array(‘H’) of devices ids. Theoutermost list contains an array(‘H’) for each replica. Each array(‘H’) has alength equal to the partition count for the ring. Each integer in thearray(‘H’) is an index into the above list of devices. The partition list isknown internally to the Ring class as _replica2part2dev_id.
这是设备id的array('I')组成的列表。列表中包含了每个副本的数组array('I')。每个array('I')的长度等于环的虚节点数。在array('I')中的每个整数是到上面设备列表的索引。虚节点列表在Ring类内部被称为_replica2part2dev_id。
So, to create a list of device dictionaries assigned to apartition, the Python code would look like:devices = [self.devs[part2dev_id[partition]] for part2dev_id inself._replica2part2dev_id]
因此,创建指派到一个虚节点的设备字典的列表,Python代码如下:devices = [self.devs[part2dev_id[partition]] for part2dev_id inself._replica2part2dev_id]
That code is a little simplistic, as it does not account for the removalof duplicate devices. If a ring has more replicas than devices, then apartition will have more than one replica on one device; that’s simply thepigeonhole principle at work.
这个代码是一个简单的,因为它不考虑重复设备的清除。如果一个环的副本比设备更多,那么一个分区在一个设备上将有一个以上的副本;这是简单的鸽笼原理。
array(‘H’) is used for memory conservation as there maybe millions of partitions.
array('I')适合保存在内存中,因为可能有几百万个虚节点。
Fractional Replicas 零碎副本
A ring is not restricted to having an integer number ofreplicas. In order to support the gradual changing of replica counts, the ringis able to have a real number of replicas.
环不限制设置的副本数必须是整数。为了支持副本计数的逐渐变化,环的副本数可以是一个实数。
When the number of replicas is not an integer, then thelast element of _replica2part2dev_id will have a length that is less than thepartition count for the ring. This means that some partitions will have morereplicas than others. For example, if a ring has 3.25 replicas, then 25% of itspartitions will have four replicas, while the remaining 75% will have justthree.
当副本数不是一个整数,那么_replica2part2dev_id最后的元素将有一个长度小于环的分区数。这意味着,部分分区将有更多的副本。例如,如果一个环有3.25个副本,那么它的25%个分区将有四个副本,而其余75%个将有三个副本。
Overload
The ring builder tries to keep replicas as far apart aspossible while still respecting device weights. When it can’t do both, theoverload factor determines what happens. Each device will take some extrafraction of its desired partitions to allow for replica dispersion; once thatextra fraction is exhausted, replicas will be placed closer together thanoptimal.
Essentially, the overload factor lets the operator tradeoff replica dispersion (durability) against data dispersion (uniform diskusage).
The default overload factor is 0, so device weights willbe strictly followed.
With an overload factor of 0.1, each device will accept10% more partitions than it otherwise would, but only if needed to maintainpartition dispersion.
Example: Consider a 3-node cluster of machines withequal-size disks; let node A have 12 disks, node B have 12 disks, and node Chave only 11 disks. Let the ring have an overload factor of 0.1 (10%).
Without the overload, some partitions would end up withreplicas only on nodes A and B. However, with the overload, every device iswilling to accept up to 10% more partitions for the sake of dispersion. Themissing disk in C means there is one disk’s worth of partitions that would liketo spread across the remaining 11 disks, which gives each disk in C an extra9.09% load. Since this is less than the 10% overload, there is one replica ofeach partition on each node.
However, this does mean that the disks in node C willhave more data on them than the disks in nodes A and B. If 80% full is thewarning threshold for the cluster, node C’s disks will reach 80% full while Aand B’s disks are only 72.7% full.
Partition Shift Value 虚节点位移值
The partition shift value is known internally to the Ringclass as _part_shift. This value used to shift an MD5 hash to calculate thepartition on which the data for that hash should reside. Only the top fourbytes of the hash is used in this process. For example, to compute the partitionfor the path /account/container/object the Python code might look like: partition = unpack_from('>I',md5('/account/container/object').digest())[0] >> self._part_shift
虚节点的位移值在Ring类内部称为_part_shift。这个值用于转换一个MD5的哈希值来计算虚节点,对于那个哈希值是哪个数据。仅哈希值的前4个字节被用于这个过程。例如,为了计算路径/account/container/object的虚节点,Python代码如下:partition = unpack_from('>I',md5('/account/container/object').digest())[0] >> self._part_shift
For a ring generated with part_power P, the partitionshift value is 32 - P.
Building the Ring构建环
The initial building of the ring first calculates thenumber of partitions that should ideally be assigned to each device based thedevice’s weight. For example, given a partition power of 20, the ring will have1,048,576 partitions. If there are 1,000 devices of equal weight they will eachdesire 1,048.576 partitions. The devices are then sorted by the number ofpartitions they desire and kept in order throughout the initialization process.
环的初始化构建首先基于设备的权重来计算理想情况下分配给每个设备的虚节点数量。例如,如虚节点幂为20,则环有1,048,576个虚节点。如果有1000个相同权重的设备,那么它们每个分到1,048.576个虚节点。设备通过它们要求的虚节点数来排序,并在整个初始化过程中保持顺序。
Note: each device is also assigned a random tiebreakervalue that is used when two devices desire the same number of partitions. Thistiebreaker is not stored on disk anywhere, and so two different rings createdwith the same parameters will have different partition assignments. Forrepeatable partition assignments, RingBuilder.rebalance() takes an optional seed value that will be used to seed Python’spseudo-random number generator.
Then, the ring builder assigns each replica of eachpartition to the device that desires the most partitions at that point whilekeeping it as far away as possible from other replicas. The ring builderprefers to assign a replica to a device in a regions that has no replicasalready; should there be no such region available, the ring builder will try tofind a device in a different zone; if not possible, it will look on a differentserver; failing that, it will just look for a device that has no replicas;finally, if all other options are exhausted, the ring builder will assign thereplica to the device that has the fewest replicas already assigned. Note thatassignment of multiple replicas to one device will only happen if the ring hasfewer devices than it has replicas.
然后,环构建器根据最适合的原则将每个虚节点的副本分配到设备,限制拥有相同虚节点的副本的设备不能在同一个区域中。每分配一次,设备要求的虚节点数减1并且移动到在设备列表中新的已排序的位置,然后进程继续执行。
When building a new ring based on an old ring, thedesired number of partitions each device wants is recalculated. Next thepartitions to be reassigned are gathered up. Any removed devices have all theirassigned partitions unassigned and added to the gathered list. Any partitionreplicas that (due to the addition of new devices) can be spread out for betterdurability are unassigned and added to the gathered list. Any devices that havemore partitions than they now desire have random partitions unassigned fromthem and added to the gathered list. Lastly, the gathered partitions are thenreassigned to devices using a similar method as in the initial assignmentdescribed above.
当基于旧环来构造新环时,每个设备所需的虚节点数量被重新计算。接下来,将需要被重新分配的虚节点收集起来。所有被移除的设备将它们已分配的虚节点取消分配并把这些虚节点添加到收集列表。任何一个拥有比目前所需的虚结点数多的设备随机地取消分配虚结点并添加到收集列表中。最后,收集列表中的虚节点使用与上述初始化分配类似的方法被重新分配。
Whenever a partition has a replica reassigned, the timeof the reassignment is recorded. This is taken into account when gatheringpartitions to reassign so that no partition is moved twice in a configurableamount of time. This configurable amount of time is known internally to theRingBuilder class as min_part_hours. This restriction is ignored for replicasof partitions on devices that have been removed, as removing a device onlyhappens on device failure and there’s no choice but to make a reassignment.
每当有虚节点的副本被重新分配,重分配的时间将被记录。我们考虑了当收集虚节点来重新分配时,没有虚节点在可配置的时间内被移动两次。这个可配置的时间数量在RingBuilder类内称为min_part_hours。这一限制对于已被移除的设备上的虚节点的副本被忽略,因为移除设备仅发生在设备故障并且此时别无择选只能进行重新分配。
The above processes don’t always perfectly rebalance aring due to the random nature of gathering partitions for reassignment. To helpreach a more balanced ring, the rebalance process is repeated until nearperfect (less 1% off) or when the balance doesn’t improve by at least 1%(indicating we probably can’t get perfect balance due to wildly imbalancedzones or too many partitions recently moved).
由于收集虚节点用来重新分配的随机本性,以上的进程并不总可以完美地重新平衡一个环。为了帮助达到一个更平衡的环,重平衡进程被重复执行直到接近完美(小于1%)或者当平衡的提升达不到最小值1%(表明由于杂乱不平衡的区域或最近移动的虚节点数过多,我们可能不能获得完美的平衡)。
Ring Builder Analyzer
This is a tool for analyzing how well the ring builderperforms its job in a particular scenario. It is intended to help developersquantify any improvements or regressions in the ring builder; it is probablynot useful to others.
The ring builder analyzer takes a scenario filecontaining some initial parameters for a ring builder plus a certain number ofrounds. In each round, some modifications are made to the builder, e.g. add adevice, remove a device, change a device’s weight. Then, the builder isrepeatedly rebalanced until it settles down. Data about that round is printed,and the next round begins.
Scenarios are specified in JSON. Example scenario for agradual device addition:
{ "part_power": 12, "replicas": 3, "overload": 0.1, "random_seed": 203488, "rounds": [ [ ["add", "r1z2-10.20.30.40:6000/sda", 8000], ["add", "r1z2-10.20.30.40:6000/sdb", 8000], ["add", "r1z2-10.20.30.40:6000/sdc", 8000], ["add", "r1z2-10.20.30.40:6000/sdd", 8000], ["add", "r1z2-10.20.30.41:6000/sda", 8000], ["add", "r1z2-10.20.30.41:6000/sdb", 8000], ["add", "r1z2-10.20.30.41:6000/sdc", 8000], ["add", "r1z2-10.20.30.41:6000/sdd", 8000], ["add", "r1z2-10.20.30.43:6000/sda", 8000], ["add", "r1z2-10.20.30.43:6000/sdb", 8000], ["add", "r1z2-10.20.30.43:6000/sdc", 8000], ["add", "r1z2-10.20.30.43:6000/sdd", 8000], ["add", "r1z2-10.20.30.44:6000/sda", 8000], ["add", "r1z2-10.20.30.44:6000/sdb", 8000], ["add", "r1z2-10.20.30.44:6000/sdc", 8000] ], [ ["add", "r1z2-10.20.30.44:6000/sdd", 1000] ], [ ["set_weight", 15, 2000] ], [ ["remove", 3], ["set_weight", 15, 3000] ], [ ["set_weight", 15, 4000] ], [ ["set_weight", 15, 5000] ], [ ["set_weight", 15, 6000] ], [ ["set_weight", 15, 7000] ], [ ["set_weight", 15, 8000] ]] }
History发展史
The ring code went through many iterations beforearriving at what it is now and while it has been stable for a while now, thealgorithm may be tweaked or perhaps even fundamentally changed if new ideasemerge. This section will try to describe the previous ideas attempted andattempt to explain why they were discarded.
环的代码在到达当前版本并保持一段时间的稳定前发生了多次反复的修改,如果有新的想法产生,环的算法可能发生改变甚至从根本上发生变化。这一章节将会描述先前尝试过的想法并且解释为何它们被废弃了。
A “live ring” option was considered where each servercould maintain its own copy of the ring and the servers would use a gossipprotocol to communicate the changes they made. This was discarded as toocomplex and error prone to code correctly in the project time span available.One bug could easily gossip bad data out to the entire cluster and be difficultto recover from. Having an externally managed ring simplifies the process,allows full validation of data before it’s shipped out to the servers, andguarantees each server is using a ring from the same timeline. It also meansthat the servers themselves aren’t spending a lot of resources maintainingrings.
曾考虑过"live ring"选项,其中每个服务器自己可以维护环的副本并且服务器将使用gossip协议进行通讯它们所作做的变化。该方法由于过于复杂并且在工程有效时间内正确编写代码容易产生错误而被废弃。一个Bug是可以很容易把坏数据gossip到整个集群而恢复很困难。通过外部管理环可以简化这一过程,允许数据在传输到服务器前进行数据的完整验证,并且保证每个服务器使用相同时间线的环。这也意味着服务器本身不用花费大量的资源来维护环。
A couple of “ring server” options were considered. Onewas where all ring lookups would be done by calling a service on a separateserver or set of servers, but this was discarded due to the latency involved.Another was much like the current process but where servers could submit changerequests to the ring server to have a new ring built and shipped back out tothe servers. This was discarded due to project time constraints and becausering changes are currently infrequent enough that manual control wassufficient. However, lack of quick automatic ring changes did mean that otherparts of the system had to be coded to handle devices being unavailable for aperiod of hours until someone could manually update the ring.
有一对"ring server"选项曾被考虑过。一个是所有的环查询可以由调用独立的服务器或服务器集上的服务器来完成,但是由于涉及到延迟被弃用了。另一个更类似于当前的过程,不过其中服务器可以提交改变的请求到环服务器来构建一个新的环,然后运回到服务器上。由于工程时间的约束以及就目前来说,环的改变的频繁足够低到人工控制就可以满足而被弃用。然后,缺乏快速自动的环改变意味着系统的其他部件不得不花上数个小时编码来处理失效的设备直到有人可以手动地升级环。
The current ring process has each replica of a partitionindependently assigned to a device. A version of the ring that used a third ofthe memory was tried, where the first replica of a partition was directlyassigned and the other two were determined by “walking” the ring until findingadditional devices in other zones. This was discarded as control was lost as tohow many replicas for a given partition moved at once. Keeping each replicaindependent allows for moving only one partition replica within a given timewindow (except due to device failures). Using the additional memory was deemeda good trade-off for moving data around the cluster much less often.
当前的环程序将一个虚节点的每个副本独立地分配给一个设备。某个环程序版本中尝试使用1/3的内存,其中虚节点的第一个副本被直接分配而另外两个则在环中“行走”直到在其它区域找到额外的设备。这个方法因为对于给定虚节点的多个副本立刻移动会使得控制失效而被废除。(不是很通顺啊)保持每个副本的独立性考虑在给定的时间窗口内仅移动一个虚节点副本(除了由于设备故障)。使用额外的内存看起来是一个不错的权衡,在集群中可以更低频率地移动数据。
Another ring design was tried where the partition todevice assignments weren’t stored in a big list in memory but instead eachdevice was assigned a set of hashes, or anchors. The partition would bedetermined from the data item’s hash and the nearest device anchors woulddetermine where the replicas should be stored. However, to get reasonabledistribution of data each device had to have a lot of anchors and walkingthrough those anchors to find replicas started to add up. In the end, thememory savings wasn’t that great and more processing power was used, so theidea was discarded.
另一个被尝试过的环设计是不把虚节点到设备的分配存储在内存中的大列表里而是为每个设备分配一个哈希集合或锚。虚节点将会来自数据项的哈希值来决定并且最近的设备锚将决定副本存储的位置。然而,为了获得更合理的数据分布,每个设备不得不用于大量的锚并且沿着这些锚来寻找副本开始合计。最后,由于内存存储没有那么大并且花费了更多的处理能力,这个想法被废弃了。
A completely non-partitioned ring was also tried butdiscarded as the partitioning helps many other parts of the system, especiallyreplication. Replication can be attempted and retried in a partition batch withthe other replicas rather than each data item independently attempted andretried. Hashes of directory structures can be calculated and compared withother replicas to reduce directory walking and network traffic.
一个完整的无虚节点的环也被尝试,但是由于虚节点有助于系统的许多其他部件,尤其是复制而被废弃。复制可以在虚节点与其它副本的批处理中被尝试和重试,而不是每个数据项独立地被尝试和重试。目录结构的哈希值可以被计算并用来与其它副本比较来减少目录的遍历和网络流量。
Partitioning and independently assigning partitionreplicas also allowed for the best balanced cluster. The best of the otherstrategies tended to give +-10% variance on device balance with devices ofequal weight and +-15% with devices of varying weights. The current strategyallows us to get +-3% and +-8% respectively.
虚节点和独立地分配虚节点的副本也考虑了最佳平衡的集群。其他策略的最佳平衡集群在设备平衡上倾向于对于平等权重的设备给出+-10%的变化而对于变化权重的设备则给出+-15%。当前的策略允许我们获得相应+-3%和+-8%的变化。
Various hashing algorithms were tried. SHA offers bettersecurity, but the ring doesn’t need to be cryptographically secure and SHA isslower. Murmur was much faster, but MD5 was built-in and hash computation is asmall percentage of the overall request handling time. In all, once it wasdecided the servers wouldn’t be maintaining the rings themselves anyway andonly doing hash lookups, MD5 was chosen for its general availability, gooddistribution, and adequate speed.
各种哈希的算法被尝试过。SHA提供更好的安全,但是环并不需要安全可靠地加密而且SHA比较慢。Murmur更快,但是MD5是Python内建的库并且哈希计算只是整个请求处理时间中只是一小部分。总之,一旦环被确定,服务器不用自己来维护环而且仅作哈希查找,MD5被选择是因为它的通用性,良好的分布以及足够快的速度。
The placement algorithm has seen a number of behavioralchanges for unbalanceable rings. The ring builder wants to keep replicas as farapart as possible while still respecting device weights. In most cases, thering builder can achieve both, but sometimes they conflict. At first, thebehavior was to keep the replicas far apart and ignore device weight, but thatmade it impossible to gradually go from one region to two, or from two tothree. Then it was changed to favor device weight over dispersion, but thatwasn’t so good for rings that were close to balanceable, like 3 machines with60TB, 60TB, and 57TB of disk space; operators were expecting one replica permachine, but didn’t always get it. After that, overload was added to the ringbuilder so that operators could choose a balance between dispersion and deviceweights.