构建一致性哈希ring Part2

本文探讨了一致性哈希ring的构建过程,从ring3_0.py到ring4_1.py的改进,涉及节点扩容、分区分配、副本(replica)策略以及故障隔离的zone概念。通过随机化分区顺序解决了数据分布不均和故障隔离问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

******************转载请注明出处!**********

最后更新:2011年8月22日22:47:38


Part1中,已经构建好了一致性哈希ring的原型。

但存在一个问题。100个结点对应着1000个虚结点。结点变动时,虚结点和结点的对应关系会发生变化。当100个结点扩张到1001个时,会发生什么情况?

新增的结点数目会挤兑掉原先的结点数目!原因就在于1000个虚结点是固定的,不变化的。如果再扩容1000个虚结点->更改虚结点和实结点之间的对应关系->调整数据,这似乎又回到ring2_0.py的老套路。

这里做一些改动以更接近真实情况。

首先是vnode,以后改称为partition。因为partition数量很少会变动,所以需要充分估计到系统预期的规模。假如不会超过6000个结点,那么虚结点可以设置为实结点的100倍。这样,当虚结点需要调整的时候,最多只会影响到1%的数据。

在计算机中,数字取2的幂阶会有一些好处。比如除法只需要移相应的位就可以了。所以ring3_0.py中,结点数为65536(2^16),partition数为8388608(2^23)个。


ring3_0.py

from array import array
from hashlib import md5
from struct import unpack_from
from time import time

PARTITION_POWER = 23
PARTITION_SHIFT = 32 - PARTITION_POWER
NODE_COUNT = 65536
DATA_ID_COUNT = 100000000

begin = time()
part2node = array('H')
for part in xrange(2 ** PARTITION_POWER):
    part2node.append(part % NODE_COUNT)
node_counts = [0] * NODE_COUNT
for data_id in xrange(DATA_ID_COUNT):
    data_id = str(data_id)
    part = unpack_from('>I',
                     md5(str(data_id)).digest())[0] >> PARTITION_SHIFT
    node_id = part2node[part]
    node_counts[node_id] += 1
desired_count = DATA_ID_COUNT / NODE_COUNT

print '%d: Desier data ids per node' % desired_count
max_count = max(node_counts)
over = 100.0 * (max_count - desired_count) / desired_count
print '%d Most data ids on one node, %.02f%% over' % (max_count, over)
min_count = min(node_counts)
under = 100.0 * (desired_count - min_count) / desired_count
print '%d Least data ids on one node, %.02f%% under' % (min_count, under)
print '%d seconds pass ...' % (time() - begin)
结果

1525: Desier data ids per node
1683 Most data ids on one node, 10.36% over
1360 Least data ids on one node, 10.82% under
234 seconds pass ...

说明:

代码比较简单,part2node是node和partition的对应关系,node_counts记录着每个node中有多少个数据映射进来。

gholt特意提到系统开销:2Byte存储16位对应结点id, 600多万个partition只对应占用12MB的内存。gholt还提到ring3_0.py结果出现10%波动的问题。主要是因为数据空间(10^8相对于partition数(约8*10^6)太小了。他尝试过更大的数据空间空间,比如10^9个数据,对应于8*10^6的partition,耗时6个小时。-_-!


ring3_0.py已经有点接近现实中一致性哈希ring了,ring4_0.py中将加入replica的特性。


ring4_0.py

from array import array
from struct import unpack_from
from hashlib import md5
from time import time

REPLICAS = 3
PARTITION_POWER = 16
PARTITION_SHIFT = 32 - PARTITION_POWER
PARTITION_MAX = 2 ** PARTITION_POWER - 1
NODE_COUNT = 256
DATA_ID_COUNT = 10000000

begin = time()
part2node = array('H')
for part in xrange(2 ** PARTITION_POWER):
    part2node.append(part % NODE_COUNT) #(3)
node_counts = [0] * NODE_COUNT
for data_id in xrange(DATA_ID_COUNT):
    data_id = str(data_id)
    part = unpack_from('>I',
                     md5(str(data_id)).digest())[0] >> PARTITION_SHIFT
    node_ids = [part2node[part]] #(1)
    node_counts[node_ids[0]] += 1
    for replica in xrange(1, REPLICAS):
        while part2node[part] in node_ids: #(2)
            part += 1
            if part > PARTITION_MAX:
                part = 0
        node_ids.append(part2node[part])
        node_counts[node_ids[-1]] += 1
desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS
print'%d: Desired data ids per node' % desired_count
max_count = max(node_counts)
over = 100.0 * (max_count - desired_count) / desired_count
print'%d: Most data ids on one node, %.02f%% over' % (max_count, over)
min_count = min(node_counts)
under = 100.0 * (desired_count - min_count) / desired_count
print'%d: Least data ids on one node,%.02f%% under' % (min_count, under)
print'%d seconds pass ...' % (time() - begin)
结果:

117186: Desired data ids per node
118133: Most data ids on one node, 0.81% over
116093: Least data ids on one node,0.93% under
52 seconds pass ...

说明:

#(1) node_ids记录的是3个replica存放的node id。part2node[part]是根据对应的partition id 找到对应的node id。

#(2)处循环3次,依次为数据的replica安排连续的partition。之后改变相应的记录,node_ids.和node_counts。

ring4_0.py看起来还不错,1%的波动。可是仍然会有两个问题:

1. #(2)处是分配连续的partition。#(3)处初始化时是有规律的。这会在一些情况下使部分数据表现很糟糕。比如这部分数据被映射到node x的partition N上,node x频繁宕机,而总是partition N上的数据需要进行复制。解决也相对容易: part2node初始化时进行随机打乱,partition N和node x不在存在某种联系。

2 数据安全的问题。假如replica所在的partiton都分布在同一个机架上。机架掉电,这会导致所有replica都不可用。因此需要一种机制对故障进行隔离。因此也就引入了zone的概念。


ring4_1.py

from array import array
from struct import unpack_from
from hashlib import md5
from time import time
from random import shuffle

REPLICAS = 3
PARTITION_POWER = 16
PARTITION_SHIFT = 32 - 16
PARTITION_MAX = 2 ** PARTITION_POWER - 1
NODE_COUNT = 256
ZONE_COUNT = 16
DATA_ID_COUNT = 10000000

begin = time()
node2zone = []
while len(node2zone) < NODE_COUNT: #(1)
    zone = 0
    while zone < ZONE_COUNT and len(node2zone) < NODE_COUNT:
        node2zone.append(zone)
        zone += 1
part2node = array('H')
for part in xrange(2 ** PARTITION_POWER):
    part2node.append(part % NODE_COUNT)
shuffle(part2node)  #(2)
node_counts = [0] * NODE_COUNT
zone_counts = [0] * ZONE_COUNT
for data_id in xrange(DATA_ID_COUNT):
    data_id = str(data_id)
    part = unpack_from('>I',
                     md5(str(data_id)).digest())[0] >> PARTITION_SHIFT
    node_ids = [part2node[part]]
    zones = [node2zone[node_ids[0]]]
    node_counts[node_ids[0]] += 1
    zone_counts[zones[0]] += 1
    for replica in xrange(1, REPLICAS):  #(3)
        while part2node[part] in node_ids and \
        node2zone[part2node[part]] in zones:
            part += 1
            if part > PARTITION_MAX:
                part = 0
        node_ids.append(part2node[part])
        zones.append(node2zone[node_ids[-1]])
        node_counts[node_ids[-1]] += 1
        zone_counts[zones[-1]] += 1
        
desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS
print '%d Desired data ids per node' % desired_count
max_count = max(node_counts)
over = 100.0 * (max_count - desired_count) / desired_count
print '%d: Most data ids on one node, %.02f%% over' % (max_count, over)
min_count = min(node_counts)
under = 100.0 * (desired_count - min_count) / desired_count
print '%d: Least data ids on one node,%.02f%% under' % (min_count, under)

desired_count = DATA_ID_COUNT / ZONE_COUNT * REPLICAS
print'%d: Desired data ids per zone' % desired_count
max_count = max(zone_counts)
over = 100.0 * (max_count - desired_count) / desired_count
print'%d: Most data ids in one zone, %.02f %% over' % (max_count, over)
min_count = min(zone_counts)
under = 100.0 * (desired_count - min_count) / desired_count
print'%d: Least data ids in one zone, %.02f %%under' % (min_count, under)
print '%d seconds pass ...' % (time() - begin)
结果:

117186 Desired data ids per node
118619: Most data ids on one node, 1.22% over
115810: Least data ids on one node,1.17% under
1875000: Desired data ids per zone
1879143: Most data ids in one zone, 0.22 % over
1871851: Least data ids in one zone, 0.17 %under
69 seconds pass ...

说明:

#(1)zonelist初始化。引入zone,以解决ring4_0.py问题2。

#(2) part2node洗牌,打乱顺序。解决ring4_0.py问题1。花的时间比较多。

#(3) 逐次探查partition位置是否合适,不能在同一个node上,也不能在同一个zone上。

 


到此,ring的基本功能都已经有了:一致性哈希ring,replica,zone。ring4_2.py给出类似图2.1的实现。当然,只是一种实现模型。swift曾经采用过这种模型,但现已废弃。这里权当扩展了解,不感兴趣的请跳到下一Part。Part3将把上述的特性封装成类,加入weight,并简单测试。

 

ring4_2.py体现的是一种anchor思想。即:维护一个node的anchor环。每次data都会查找anchor环找到和匹配自己的存储node位置。

anchor环就是ring4_2.py中的hash2index和index2node。每个数据都在#(2)处查找index,之后到index2node找对应的node。#(1)中的二分查找、hash计算都是比较耗时的。系统开销暂不提,这种实现并不能够均匀分布数据。为使数据分布的更为均匀,每个node都要维护anchor,不断遍历anchor环以查找合适位置。而且,这种情况下replica的管理会非常麻烦。

空间和时间,计算机中的博弈。ring4_1.py采用的空间换时间,简化了对应关系;而ring4_2.py采用了时间换空间,思路简单,但效果不尽人意。


ring4_2.py

from bisect import bisect_left
from hashlib import md5
from struct import unpack_from
from time import time

REPLICAS = 3
NODE_COUNT = 256
ZONE_COUNT = 16
DATA_ID_COUNT = 10000000
VNODE_COUNT = 100

begin = time()
node2zone = []
while len(node2zone) < NODE_COUNT:
    zone = 0
    while zone < ZONE_COUNT and len(node2zone) < NODE_COUNT:
        node2zone.append(zone)
        zone += 1
hash2index = []
index2node = []
for node in xrange(NODE_COUNT):
    for vnode in xrange(VNODE_COUNT):
        hsh = unpack_from('>I', md5(str(node)).digest())[0]
        index = bisect_left(hash2index, hsh)
        if index > len(hash2index):
            index = 0
        hash2index.insert(index, hsh)
        index2node.insert(index, node)
node_counts = [0] * NODE_COUNT
zone_counts = [0] * ZONE_COUNT
for data_id in xrange(DATA_ID_COUNT):  #(1)
    data_id = str(data_id)
    hsh = unpack_from('>I', md5(str(data_id)).digest())[0]
    index = bisect_left(hash2index, hsh)  #(2)
    if index >= len(hash2index):
        index = 0
    node_ids = [index2node[index]]
    zones = [node2zone[node_ids[0]]]
    node_counts[node_ids[0]] += 1
    zone_counts[zones[0]] += 1
    for replica in xrange(1, REPLICAS):
        while index2node[index] in node_ids and node2zone[index2node[index]] in zones:
            index += 1
            if index >= len(hash2index):
                index = 0
        node_ids.append(index2node[index])
        zones.append(node2zone[node_ids[-1]])
        node_counts[node_ids[-1]] += 1
        zone_counts[zones[-1]] += 1
desired_count = DATA_ID_COUNT / NODE_COUNT * REPLICAS
print '%d: Desired data ids per node' % desired_count
max_count = max(node_counts)
over = 100.0 * (max_count - desired_count) / desired_count
print '%d: Most data ids on one node, %.02f%% over' % (max_count, over)
min_count = min(node_counts)
under = 100.0 * (desired_count - min_count) / desired_count
print '%d: Least data ids on one node, %.02f%% under' % (min_count, under)

desired_count = DATA_ID_COUNT / ZONE_COUNT * REPLICAS
print '%d: Desired data ids per zone' % desired_count
max_count = max(zone_counts)
over = 100.0 * (max_count - desired_count) / desired_count
print '%d: Most data ids on one zone, %.02f%% over' % (max_count, over)
min_count = min(zone_counts)
under = 100.0 * (desired_count - min_count) / desired_count
print '%d: Least data ids on one zone, %.02f%% under' % (min_count, under)

print '%d seconds pass ...' % (time() - begin)
结果:

117186: Desired data ids per node
351282: Most data ids on one node, 199.76% over
15965: Least data ids on one node, 86.38% under
1875000: Desired data ids per zone
2248496: Most data ids on one zone, 19.92% over
1378013: Least data ids on one zone, 26.51% under
990 seconds pass ...



评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值