选举算法源码思路
Zookeeper的Leader选举类是FastLeaderElection,该类是ZAB协议在Leader选举中的工程应用,所以直接找到该类对其进行分析。该类中最为重要的方法为lookForLeader(),是选举Leader的核心方法。该方法大体思路可以划分为以下几块:
1 选举前的准备工作
选举前需要做一些准备工作,例如,创建选举对象、创建选举过程中需要用到的集合、初始化选举时限等。
2 将自己作为初始化Leader投出去
在当前Server第一次投票时会先将自己作为Leader,然后将自己的选票广播给其它所有Server。
3 验证自己的投票与大家的投票谁更适合做Leader
在“我选我”后,当前Server同样会接收到其它Server发送来的选票通知(Notification)。通过while循环,遍历所有接收到的选票通知,比较谁更适合做Leader。若找到一个比自己更适合的Leader,则修改自己选票,重新将新的选票广播出去。当然,每验证一个选票,则会将其记录到一个集合中,将来用于进行票数统计。
4 判断本轮选举是否应结束
其实在每次验证过谁更适合做Leader后,就会马上判断当前的选举是否可以结束了,即当前主机所推荐的这个选票是否过半了。若过半了,则直接完成后续的一些收尾工作,例如清空选举过程中所使用的集合,以备下次使用;再例如,生成最终的选票,以备其它Server来同步数据。若没有过半,则继续从队列中读取出下一个来自于其它主机的选票,然后进行验证。
选举算法源码解析
对源码的阅读主要包含两方面。一个是对重要类、重要成员变量、重要方法的注释的阅读;一个是对重要方法的业务逻辑的分析。
1 源码结构

2.leader选举源码

FastLeaderElection部分代码
/**
* Implementation of leader election using TCP. It uses an object of the class
* QuorumCnxManager to manage connections. Otherwise, the algorithm is push-based
* as with the other UDP implementations.
*
* There are a few parameters that can be tuned to change its behavior. First,
* finalizeWait determines the amount of time to wait until deciding upon a leader.
* This is part of the leader election algorithm.
*
* 使用TCP实现leader领导人选举,它使用QuorumCnxManager类的一个对象管理连接(与其他server间的连接管理)。
* 否则(如果不使用QuorumCnxManager对象的话),将使用UDP基于推送的算法实现。
*
* 有几个参数可以用来改变它(选举)的行为。
* 首先,finalizeWait(一个代码中的常量)决定选举一个leader的时间,
* 这是leader选举算法的一部分。
*/
public class FastLeaderElection implements Election {
private static final Logger LOG = LoggerFactory.getLogger(FastLeaderElection.class);
/**
* Determine how much time a process has to wait
* once it believes that it has reached the end of
* leader election.
* (此常量)决定一个选举过程需要等待的选举时间;
* 一经到达,将结束leader选举。
* 默认200毫秒
* 实际此时间是节点等待收取其他节点选举消息(Notification)的时间;
*/
final static int finalizeWait = 200;
/**
* Upper bound on the amount of time between two consecutive
* notification checks. This impacts the amount of time to get
* the system up again after long partitions. Currently 60 seconds.
* (此常量)指定两个连续的notification检查的时间间隔上线;
* 其影响了系统在经历了长时间分割后再次重启的时间,默认60秒。
* 此常量其实就是finalizeWait的最大值,代表如果在60秒内还没有选举出leader,
* 那么重新发起一轮选举;
*/
final static int maxNotificationInterval = 60000;
/**
* Connection manager. Fast leader election uses TCP for
* communication between peers, and QuorumCnxManager manages
* such connections.
* 连接管理者(类)。FastLeaderElection选举算法使用TCP(管理)
* 两个同辈server的通信,并且QuorumCnxManager还管理着这些连接。
*/
QuorumCnxManager manager;
/**
* Notifications are messages that let other peers know that
* a given peer has changed its vote, either because it has
* joined leader election or because it learned of another
* peer with higher zxid or same zxid and higher server id
* Notifications是一个让其它server知道当前server已经改变了
* 投票的通知消息(为什么要改变投票?),要么是因为它参与了leader
* 选举(新一轮投票,首先投给自己),要么是它具有更大的zxid,或者
* zxid相同但是ServerID(myid)更大。
*/
static public class Notification {
/*
* Format version, introduced in 3.4.6
*/
public final static int CURRENTVERSION = 0x1;
int version;
/*
* Proposed leader
* 当前选票所推荐做leader的ServerID
*/
long leader;
/*
* zxid of the proposed leader
* 当前选票所推荐做leader的最大zxid
*/
long zxid;
/*
* Epoch
* 当前本轮选举的epoch,即逻辑时钟
*/
long electionEpoch;
/*
* current state of sender
* 当前通知的发送者的状态(四种状态)
*/
QuorumPeer.ServerState state;
/*
* Address of sender
* 当前通知发送者的serverID
*/
long sid;
/*
* epoch of the proposed leader
* 当前选票所推荐做leader的epoch
*/
long peerEpoch;
.
.
.
.
.
.
QuorumPeer self; //当前参与选举的server(当前主机)
Messenger messenger;
//logicalclock逻辑时钟,原子整型
AtomicLong logicalclock = new AtomicLong(); /* Election instance */
//记录当前server的推荐情况
long proposedLeader;
long proposedZxid;
long proposedEpoch;
.
.
.
.
.
.
QuorumPeer
/**
* This class manages the quorum protocol. There are three states this server
* can be in:
* <ol>
* <li>Leader election - each server will elect a leader (proposing itself as a
* leader initially).</li>
* <li>Follower - the server will synchronize with the leader and replicate any
* transactions.</li>
* <li>Leader - the server will process requests and forward them to followers.
* A majority of followers must log the request before it can be accepted.
* </ol>
*
* This class will setup a datagram socket that will always respond with its
* view of the current leader. The response will take the form of:
* 这个类管理着“法定人数投票”协议;这个服务器有三种状态:
* (1) Leader election:(处于该状态)每一个服务器将选举一个Leader(最初推荐自己
* 作为Leader)。(此状态即为LOOKING状态)
* (2)Folloer:(处于该状态的)服务器将与Leader做同步,并复制所有的事务(注意这里
* 的事务指的是最终的提议Proposal;txid中的tx即为事务)。
* (3)Leader:(处于该状态的)服务器将处理请求,并将这些请求转发给其他Follower,大
* 多数Follower在该写请求被批准之前(before it can be accepted)都必须要记录下该
* 请求(注意,这里的请求指的是写请求,Leader在接收到写请求后会向所有Follower发出
* 提议,在大多数Follower同意后该写请求才会被批准执行)。
* <pre>
* int xid;
*
* long myid;
*
* long leader_id;
*
* long leader_zxid;
* </pre>
*
* The request for the current leader will consist solely of an xid: int xid;
*/
public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider {
private static final Logger LOG = LoggerFactory.getLogger(QuorumPeer.class);
QuorumBean jmxQuorumBean;
LocalPeerBean jmxLocalPeerBean;
LeaderElectionBean jmxLeaderElectionBean;
QuorumCnxManager qcm;
QuorumAuthServer authServer;
QuorumAuthLearner authLearner;
// VisibleForTesting. This flag is used to know whether qLearner's and
// qServer's login context has been initialized as ApacheDS has concurrency
// issues. Refer https://issues.apache.org/jira/browse/ZOOKEEPER-2712
private boolean authInitialized = false;
.
.
.
.
.
.
QuorumCnxManager
/**
* This class implements a connection manager for leader election using TCP. It
* maintains one connection for every pair of servers. The tricky part is to
* guarantee that there is exactly one connection for every pair of servers that
* are operating correctly and that can communicate over the network.
* 这个类使用TCP实现了一个用于Leader选举的连接管理器。
* 它为每一对服务器维护着一个连接。棘手的部分在于确保[为每对服务器正确地操作
* 并且可以与整个网络进行通信的]连接恰有一个。
*
* If two servers try to start a connection concurrently, then the connection
* manager uses a very simple tie-breaking mechanism to decide which connection
* to drop based on the IP addressed of the two parties.
* 如果两个服务器试图同时启动一个连接,则连接管理器使用非常简单的中断连接
* 机制来决定哪个中断,基于双方的IP地址。
*
*
* For every peer, the manager maintains a queue of messages to send. If the
* connection to any particular peer drops, then the sender thread puts the
* message back on the list. As this implementation currently uses a queue
* implementation to maintain messages to send to another peer, we add the
* message to the tail of the queue, thus changing the order of messages.
* Although this is not a problem for the leader election, it could be a problem
* when consolidating peer communication. This is to be verified, though.
* 对于每个对等体(sever),管理器维护着一个消息发送队列。如果连接到任何
* 特定的Server中断,那么发送者线程将消息放回到这个队列中。
* 作为这个实现,当前使用一个队列来实现维护发送给另一方的消息,因此我们将消息
* 添加到队列的尾部,从而更改了消息的顺序。虽然对于Leader选举来说这不是一个问题,
* 但对于加强对等通信可能就是个问题。不过,这一点有待验证。
*/
public class QuorumCnxManager {
private static final Logger LOG = LoggerFactory.getLogger(QuorumCnxManager.class);
/*
* Maximum capacity of thread queues
*/
static final int RECV_CAPACITY = 100;
// Initialized to 1 to prevent sending
// stale notifications to peers
static final int SEND_CAPACITY = 1;
static final int PACKETMAXSIZE = 1024 * 512;
/*
* Max buffer size to be read from the network.
*/
static public final int maxBuffer = 2048;
/*
* Negative counter for observer server ids.
*/
private AtomicLong observerCounter = new AtomicLong(-1);
/*
* Connection time out value in milliseconds
*/
private int cnxTO = 5000;
消息发送队列:QuorumCnxManager

上图标示myid为1的server中的QuorumCnxManager对象维护着一个消息队列,队列中的Key都是由1号主机将消息发往的主机的myid,所有不存在自己的myid(其他主机也同样);value存储的是1号主机发送失败的消息副本,发送成功不放入;
根据当前Map中的各个Value(队列)是否为空,可以判断当前Server与整个集群的连接状态:
- 若所有队列均为空:说明当前Server与集群的连接没有任务问题;
- 若所有队列均为不空:说明当前Server与集群已经失去连接;
- 若某一个队列不为空:说明当前Server与该队列对应的Server的连接出现问题;
- 若某一个队列为空:说明当前Server与集群的连接正常;(代码实现中应用此条件判断)

Zookeeper Leader选举解析
本文深入剖析Zookeeper的FastLeaderElection类,详述选举前准备、自我投票、比较及票数统计流程,解读核心方法lookForLeader(),并解析源码结构与重要参数。
924

被折叠的 条评论
为什么被折叠?



