上一遍源码分析,关注swift-ring-bin文件,其中最为复杂,也是最为重要操作要数rebalance方法了,它是用来重新生成ring文件,再你修改builder文件后(例如增减设备)使系统中的partition分布平衡(当然,在rebalance后,需要重新启动系统的各个服务)。其中一致性的哈希算法,副本的概念,zone的概念,weight的概念都是通过它来实现的。
源码片段:
swift-ring-builder rebalance方法。
01 |
def rebalance(): |
02 |
""" |
03 |
swift-ring-builder
<builder_file> rebalance |
04 |
Attempts
to rebalance the ring by reassigning partitions that haven't been |
05 |
recently
reassigned. |
06 |
""" |
07 |
devs_changed = builder.devs_changed #devs_changed代表builder中的devs是否改变,默认是Flase,当调用add_dev,set_dev_weight,remove_dev,会把devs_changed设置为True。 |
08 |
try : |
09 |
last_balance = builder.get_balance() #调用builder.get_balance方法,返回ring的banlance
也就是平衡度 例如0.83%。 |
10 |
parts,
balance = builder.rebalance() #主要的重平衡方法,返回重新分配的partition的数目和新的balance。 |
11 |
except exceptions.RingBuilderError,
e: |
12 |
print '-' * 79 |
13 |
print ( "An
error has occurred during ring validation. Common\n" |
14 |
"causes
of failure are rings that are empty or do not\n" |
15 |
"have
enough devices to accommodate the replica count.\n" |
16 |
"Original
exception message:\n %s" % e.message |
17 |
) |
18 |
print '-' * 79 |
19 |
exit(EXIT_ERROR) |
20 |
if not parts: |
21 |
print 'No
partitions could be reassigned.' |
22 |
print 'Either
none need to be or none can be due to ' \ |
23 |
'min_part_hours
[%s].' % builder.min_part_hours |
24 |
exit(EXIT_WARNING) |
25 |
if not devs_changed and abs (last_balance - balance)
< 1 : |
26 |
print 'Cowardly
refusing to save rebalance as it did not change ' \ |
27 |
'at
least 1%.' |
28 |
exit(EXIT_WARNING) |
29 |
try : |
30 |
builder.validate() #安全功能方法,捕捉bugs,确保partition发配到真正的device上,不被分配两次等等一些功能。 |
31 |
except exceptions.RingValidationError,
e: |
32 |
print '-' * 79 |
33 |
print ( "An
error has occurred during ring validation. Common\n" |
34 |
"causes
of failure are rings that are empty or do not\n" |
35 |
"have
enough devices to accommodate the replica count.\n" |
36 |
"Original
exception message:\n %s" % e.message |
37 |
) |
38 |
print '-' * 79 |
39 |
exit(EXIT_ERROR) |
40 |
print 'Reassigned
%d (%.02f%%) partitions. Balance is now %.02f.' % \ |
41 |
(parts, 100.0 * parts / builder.parts,
balance) #打印rebalance结果 |
42 |
status = EXIT_SUCCESS |
43 |
if balance
> 5 : #balnce大于5会提示,最小的系统平衡时间。 |
44 |
print '-' * 79 |
45 |
print 'NOTE:
Balance of %.02f indicates you should push this ' % \ |
46 |
balance |
47 |
print '
ring, wait at least %d hours, and rebalance/repush.' \ |
48 |
% builder.min_part_hours |
49 |
print '-' * 79 |
50 |
status = EXIT_WARNING |
51 |
ts = time() #截取时间。 |
52 |
builder.get_ring().save( #保存新生成的builder
ring文件 |
53 |
pathjoin(backup_dir, '%d.' % ts + basename(ring_file))) |
54 |
pickle.dump(builder.to_dict(), open (pathjoin(backup_dir, |
55 |
'%d.' % ts + basename(argv[ 1 ])), 'wb' ),
protocol = 2 ) |
56 |
builder.get_ring().save(ring_file) |
57 |
pickle.dump(builder.to_dict(), open (argv[ 1 ], 'wb' ),
protocol = 2 ) |
58 |
exit(status) |
其中我加入了一些自己的注释,方便理解。实际上是调用了builder.py中的rebalance方法。
builder.py 中的rebalance方法:
01 |
def rebalance( self ): |
02 |
""" |
03 |
Rebalance
the ring. |
04 |
05 |
This
is the main work function of the builder, as it will assign and |
06 |
reassign
partitions to devices in the ring based on weights, distinct |
07 |
zones,
recent reassignments, etc. |
08 |
09 |
The
process doesn't always perfectly assign partitions (that'd take a |
10 |
lot
more analysis and therefore a lot more time -- I had code that did |
11 |
that
before). Because of this, it keeps rebalancing until the device |
12 |
skew
(number of partitions a device wants compared to what it has) gets |
13 |
below
1% or doesn't change by more than 1% (only happens with ring that |
14 |
can't
be balanced no matter what -- like with 3 zones of differing |
15 |
weights
with replicas set to 3). |
16 |
17 |
:returns:
(number_of_partitions_altered, resulting_balance) |
18 |
""" |
19 |
self ._ring = None #令实例中的ring为空 |
20 |
if self ._last_part_moves_epoch is None : |
21 |
self ._initial_balance() #增加一些初始化设置的balance方法, |
22 |
self .devs_changed = False |
23 |
return self .parts, self .get_balance() |
24 |
retval = 0 |
25 |
self ._update_last_part_moves() #更新part
moved时间。 |
26 |
last_balance = 0 |
27 |
while True : |
28 |
reassign_parts = self ._gather_reassign_parts() #返回一个list(part,replica)对,需要重新分配。 |
29 |
self ._reassign_parts(reassign_parts) #重新分配的实际动作 |
30 |
retval + = len (reassign_parts) |
31 |
while self ._remove_devs: |
32 |
self .devs[ self ._remove_devs.pop()[ 'id' ]] = None #删除相应的dev |
33 |
balance = self .get_balance() #获取新的平衡比 |
34 |
if balance
< 1 or abs (last_balance - balance)
< 1 or \ |
35 |
retval = = self .parts: |
36 |
break |
37 |
last_balance = balance |
38 |
self .devs_changed = False |
39 |
self .version + = 1 |
40 |
return retval,
balance |
程序会根据_last_part_moves_epoch是否为None来决定,程序执行的路线。如果为None(说明是第一次rebalance),程序会调用_initial_balance()方法,然后返回结果,其实它的操作跟_last_part_moves_epoch不为None时,进行的操作大体相同,只是_initial_balance会做一些初始化的操作。而真正执行rebalance操作动作的是_reassign_parts方法。
builder.py中的_reassign_parts分配part的动作方法。
001 |
def _reassign_parts( self ,
reassign_parts): |
002 |
""" |
003 |
For
an existing ring data set, partitions are reassigned similarly to |
004 |
the
initial assignment. The devices are ordered by how many partitions |
005 |
they
still want and kept in that order throughout the process. The |
006 |
gathered
partitions are iterated through, assigning them to devices |
007 |
according
to the "most wanted" while keeping the replicas as "far |
008 |
apart"
as possible. Two different zones are considered the |
009 |
farthest-apart
things, followed by different ip/port pairs within a |
010 |
zone;
the least-far-apart things are different devices with the same |
011 |
ip/port
pair in the same zone. |
012 |
013 |
If
you want more replicas than devices, you won't get all your |
014 |
replicas. |
015 |
016 |
:param
reassign_parts: An iterable of (part, replicas_to_replace) |
017 |
pairs.
replicas_to_replace is an iterable of the |
018 |
replica
(an int) to replace for that partition. |
019 |
replicas_to_replace
may be shared for multiple |
020 |
partitions,
so be sure you do not modify it. |
021 |
""" |
022 |
for dev in self ._iter_devs(): |
023 |
dev[ 'sort_key' ] = self ._sort_key_for(dev) #设置每一个dev的sort_key |
024 |
available_devs = \ #迭代出可用的devs根据sort_key排序 |
025 |
sorted ((d for d in self ._iter_devs() if d[ 'weight' ]), |
026 |
key = lambda x:
x[ 'sort_key' ]) |
027 |
028 |
tier2children = build_tier_tree(available_devs) #生产层结构devs |
029 |
030 |
tier2devs = defaultdict( list ) #devs层 |
031 |
tier2sort_key = defaultdict( list ) #sort_key层 |
032 |
tiers_by_depth = defaultdict( set ) #深度层 |
033 |
for dev in available_devs: #安装不同方式分类排序。 |
034 |
for tier in tiers_for_dev(dev): |
035 |
tier2devs[tier].append(dev) #
<-- starts out sorted! |
036 |
tier2sort_key[tier].append(dev[ 'sort_key' ]) |
037 |
tiers_by_depth[ len (tier)].add(tier) |
038 |
039 |
for part,
replace_replicas in reassign_parts: |
040 |
#
Gather up what other tiers (zones, ip_ports, and devices) the |
041 |
#
replicas not-to-be-moved are in for this part. |
042 |
other_replicas = defaultdict( lambda : 0 ) #不同的zone
ip_port device_id标识 |
043 |
for replica in xrange ( self .replicas): |
044 |
if replica not in replace_replicas: |
045 |
dev = self .devs[ self ._replica2part2dev[replica][part]] |
046 |
for tier in tiers_for_dev(dev): |
047 |
other_replicas[tier] + = 1 #不需要重新分配的会被+1 |
048 |
049 |
def find_home_for_replica(tier = (),
depth = 1 ): |
050 |
#
Order the tiers by how many replicas of this |
051 |
#
partition they already have. Then, of the ones |
052 |
#
with the smallest number of replicas, pick the |
053 |
#
tier with the hungriest drive and then continue |
054 |
#
searching in that subtree. |
055 |
# |
056 |
#
There are other strategies we could use here, |
057 |
#
such as hungriest-tier (i.e. biggest |
058 |
#
sum-of-parts-wanted) or picking one at random. |
059 |
#
However, hungriest-drive is what was used here |
060 |
#
before, and it worked pretty well in practice. |
061 |
# |
062 |
#
Note that this allocator will balance things as |
063 |
#
evenly as possible at each level of the device |
064 |
#
layout. If your layout is extremely unbalanced, |
065 |
#
this may produce poor results. |
066 |
candidate_tiers = tier2children[tier] #逐层的找最少的part |
067 |
min_count = min (other_replicas[t] for t in candidate_tiers) |
068 |
candidate_tiers = [t for t in candidate_tiers |
069 |
if other_replicas[t] = = min_count] |
070 |
candidate_tiers.sort( |
071 |
key = lambda t:
tier2sort_key[t][ - 1 ]) |
072 |
073 |
if depth = = max (tiers_by_depth.keys()): |
074 |
return tier2devs[candidate_tiers[ - 1 ]][ - 1 ] |
075 |
076 |
return find_home_for_replica(tier = candidate_tiers[ - 1 ], |
077 |
depth = depth + 1 ) |
078 |
079 |
for replica in replace_replicas: #对于要分配的dev做相应的处理 |
080 |
dev = find_home_for_replica() |
081 |
dev[ 'parts_wanted' ] - = 1 |
082 |
dev[ 'parts' ] + = 1 |
083 |
old_sort_key = dev[ 'sort_key' ] |
084 |
new_sort_key = dev[ 'sort_key' ] = self ._sort_key_for(dev) |
085 |
for tier in tiers_for_dev(dev): |
086 |
other_replicas[tier] + = 1 |
087 |
088 |
index = bisect.bisect_left(tier2sort_key[tier], |
089 |
old_sort_key) |
090 |
tier2devs[tier].pop(index) |
091 |
tier2sort_key[tier].pop(index) |
092 |
093 |
new_index = bisect.bisect_left(tier2sort_key[tier], |
094 |
new_sort_key) |
095 |
tier2devs[tier].insert(new_index,
dev) |
096 |
tier2sort_key[tier].insert(new_index,
new_sort_key) |
097 |
098 |
self ._replica2part2dev[replica][part] = dev[ 'id' ] #某个part的某个replica分配到dev['id'] |
099 |
100 |
#
Just to save memory and keep from accidental reuse. |
101 |
for dev in self ._iter_devs(): |
102 |
del dev[ 'sort_key' ] |
这个函数实现了重新分配的功能,其中重要的概念是三层结构,也就是utrls.py文件,会针对一个dev 或者一个devs,返回三层结构的字典。
源码中给我们举了一个例子:
Example:
zone 1 -+---- 192.168.1.1:6000 -+---- device id 0
| |
| +---- device id 1
| |
| +---- device id 2
|
+---- 192.168.1.2:6000 -+---- device id 3
|
+---- device id 4
|
+---- device id 5
zone 2 -+---- 192.168.2.1:6000 -+---- device id 6
| |
| +---- device id 7
| |
| +---- device id 8
|
+---- 192.168.2.2:6000 -+---- device id 9
|
+---- device id 10
|
+---- device id 11
The tier tree would look like:
{
(): [(1,), (2,)],
(1,): [(1, 192.168.1.1:6000),
(1, 192.168.1.2:6000)],
(2,): [(1, 192.168.2.1:6000),
(1, 192.168.2.2:6000)],
(1, 192.168.1.1:6000): [(1, 192.168.1.1:6000, 0),
(1, 192.168.1.1:6000, 1),
(1, 192.168.1.1:6000, 2)],
(1, 192.168.1.2:6000): [(1, 192.168.1.2:6000, 3),
(1, 192.168.1.2:6000, 4),
(1, 192.168.1.2:6000, 5)],
(2, 192.168.2.1:6000): [(1, 192.168.2.1:6000, 6),
(1, 192.168.2.1:6000, 7),
(1, 192.168.2.1:6000, 8)],
(2, 192.168.2.2:6000): [(1, 192.168.2.2:6000, 9),
(1, 192.168.2.2:6000, 10),
(1, 192.168.2.2:6000, 11)],
}
通过zone,ip_port,device_id 分成三层,之后的操作会根据层次,进行相关的操作(这其中就实现了zone,副本等概念)。
这样一个ring rebalance操作就做好了,最后会保存新的 builder文件,和ring文件,ring文件时根据生产的builder文件调用了RingData类中的方法保存的比较简单,这里不做分析。
这样大体上就分析了swift-ring-builder, /swift/common/ring/下的文件,其中具体的函数具体的功能与实现,可以查看源码。下一篇文章我会分析一下swift-init,用通过start方法来说明服务启动的流程。