Zookeeper快速选举流程详解
在讲解流程之前,先说明一下选举流程中涉及到的角色,以及涉及到的关键类和变量(源码参考版本:3.4.9):
角色:1.LOOKING:竞选
2.OBSERVING:观察
3.FOLLOWING:跟随者
4.LEADER:领导者
投票信息:
1.logicalclock(electionEpoch):本地选举周期,每次投票都会自增
2.epoch(peerEpoch):选举周期,每次选举最终确定完leader结束选举流程时会自增(真正zxid的前32位)
3.zxid:数据ID,每次数据变动都会自增(真正zxid的后32位,zxid一共64位)
4.sid:该投票信息所属的serverId
5.leader:提议的leader(被提议的server的serverId,即sid)
投票比较规则:
1.epoch大的胜出,否则进行步骤2
2.zxid大的胜出,否则进行步骤3
3.sid大的胜出
比较规则的源码如下:
- /**
- * Check if a pair (server id, zxid) succeeds our
- * current vote.
- *
- * @param id Server identifier
- * @param zxid Last zxid observed by the issuer of this vote
- */
- protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
- LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
- Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
- if(self.getQuorumVerifier().getWeight(newId) == 0){
- return false;
- }
-
- /*
- * We return true if one of the following three cases hold:
- * 1- New epoch is higher
- * 2- New epoch is the same as current epoch, but new zxid is higher
- * 3- New epoch is the same as current epoch, new zxid is the same
- * as current zxid, but server id is higher.
- */
-
- return ((newEpoch > curEpoch) ||
- ((newEpoch == curEpoch) &&
- ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
- }
下面首先讲解一下大概的选举流程,这里暂时先不用考虑投票的数据是如何进行交互的,只管拿来用即可,后续会讲到选举期间投票数据是如何进行交互的。
1.首先更新logicalclock并提议自己为leader并广播出去
2.进入本轮投票的循环
3.从recvqueue队列中获取一个投票信息,如果为空则检查是否要重发自己的投票或者重连,否则进入步骤4
4.判断投票信息中的选举状态:
LOOKING状态:1.如果对方的logicalclock大于本地的logicalclock,则更新本地的logicalclock并清空本地投票信息统计箱recvset,并将自己作为候选和投票中的leader进行比较,选择大的作为新的投票,然后广播出去,否则进入步骤2
2.如果对方的logicalclock小于本地的logicalclock,则忽略对方的投票,重新进入下一轮选举流程,否则进入步骤3
3.如果两方的logicalclock相等,则比较当前本地被推选的leader和投票中的leader,选择大的作为新的投票,然后广播出去
4.把对方的投票信息保存到本地投票统计箱recvset中,判断当前被选举的leader是否在投票中占了大多数(大于一半的server数量),如果是则需再等待finalizeWait时间(从recvqueue继续poll投票消息)看是否有人修改了leader的候选,如果有则再将该投票信息再放回recvqueue中并重新开始下一轮循环,否则确定角色,结束选举
OBSERVING状态:没有投票权,无视直接进入下一轮选举
FOLLOWING/LEADING:1.如果对方的logicalclock等于本地的logicalclock,把对方的投票信息保存到本地投票统计箱recvset中,判断对方的投票信息是否在recvset中占大多数并且确认自己确实为leader,如果是则确定角色,结束选举,否则进入步骤2
2.将对方的投票信息放入本地统计不参与投票信息箱outofelection中,判断对方的投票信息是否在outofelection中占大多数并且确认自己确实为leader,如果是则更新logicalclock,并确定角色,结束选举,否则进入下一轮选举
选举流程源码如下:
- /**
- * Starts a new round of leader election. Whenever our QuorumPeer
- * changes its state to LOOKING, this method is invoked, and it
- * sends notifications to all other peers.
- *
- * 开始新的一轮leader选举。
- * 每当当前的peer的选举状态为LOOKING时,这个方法就会执行,并且会向其他peer发送提议leader消息。
- *
- */
- public Vote lookForLeader() throws InterruptedException {
- try {
- self.jmxLeaderElectionBean = new LeaderElectionBean();
- MBeanRegistry.getInstance().register(
- self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
- } catch (Exception e) {
- LOG.warn("Failed to register with JMX", e);
- self.jmxLeaderElectionBean = null;
- }
- if (self.start_fle == 0) {
- self.start_fle = System.currentTimeMillis();
- }
- try {
- //本机统计的投票信息
- HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
-
- //FOLLOWING LEADING状态的节点信息-->非LOOKING状态
- HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
-
- int notTimeout = finalizeWait;
-
- //提议选举自己为leader
- synchronized(this){
- logicalclock++;
- updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
- }
-
- LOG.info("New election. My id = " + self.getId() +
- ", proposed zxid=0x" + Long.toHexString(proposedZxid));
- sendNotifications();
-
- /*
- * Loop in which we exchange notifications until we find a leader
- *
- * 循环:开始交换提议信息,直到选举出leader
- */
-
- while ((self.getPeerState() == ServerState.LOOKING) &&
- (!stop)){
- /*
- * Remove next notification from queue, times out after 2 times
- * the termination time
- */
- Notification n = recvqueue.poll(notTimeout,
- TimeUnit.MILLISECONDS);
-
- /*
- * Sends more notifications if haven't received enough.
- * Otherwise processes new notification.
- */
- if(n == null){
- if(manager.haveDelivered()){
- sendNotifications();
- } else {
- manager.connectAll();
- }
-
- /*
- * Exponential backoff
- */
- int tmpTimeOut = notTimeout*2;
- notTimeout = (tmpTimeOut < maxNotificationInterval?
- tmpTimeOut : maxNotificationInterval);
- LOG.info("Notification time out: " + notTimeout);
- }
- else if(self.getVotingView().containsKey(n.sid)) {
- /*
- * Only proceed if the vote comes from a replica in the
- * voting view.
- */
- switch (n.state) {
- case LOOKING:
- // If notification > current, replace and send messages out
- if (n.electionEpoch > logicalclock) {
- logicalclock = n.electionEpoch;
- recvset.clear();
- if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
- getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
- updateProposal(n.leader, n.zxid, n.peerEpoch);
- } else {
- updateProposal(getInitId(),
- getInitLastLoggedZxid(),
- getPeerEpoch());
- }
- sendNotifications();
- } else if (n.electionEpoch < logicalclock) {
- if(LOG.isDebugEnabled()){
- LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
- + Long.toHexString(n.electionEpoch)
- + ", logicalclock=0x" + Long.toHexString(logicalclock));
- }
- break;
- } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
- proposedLeader, proposedZxid, proposedEpoch)) {
- updateProposal(n.leader, n.zxid, n.peerEpoch);
- sendNotifications();
- }
-
- if(LOG.isDebugEnabled()){
- LOG.debug("Adding vote: from=" + n.sid +
- ", proposed leader=" + n.leader +
- ", proposed zxid=0x" + Long.toHexString(n.zxid) +
- ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
- }
-
- // 把对方的投票意愿缓存起来,用于最终的统计
- recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
-
- if (termPredicate(recvset,
- new Vote(proposedLeader, proposedZxid,
- logicalclock, proposedEpoch))) {
-
- // Verify if there is any change in the proposed leader
- while((n = recvqueue.poll(finalizeWait,
- TimeUnit.MILLISECONDS)) != null){
- if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
- proposedLeader, proposedZxid, proposedEpoch)){
- recvqueue.put(n);
- break;
- }
- }
-
- /*
- * This predicate is true once we don't read any new
- * relevant message from the reception queue
- */
- if (n == null) {
- self.setPeerState((proposedLeader == self.getId()) ?
- ServerState.LEADING: learningState());
-
- Vote endVote = new Vote(proposedLeader,
- proposedZxid,
- logicalclock,
- proposedEpoch);
- leaveInstance(endVote);
- return endVote;
- }
- }
- break;
- case OBSERVING:
- LOG.debug("Notification from observer: " + n.sid);
- break;
- case FOLLOWING:
- case LEADING:
- /*
- * Consider all notifications from the same epoch
- * together.
- */
- if(n.electionEpoch == logicalclock){
- recvset.put(n.sid, new Vote(n.leader,
- n.zxid,
- n.electionEpoch,
- n.peerEpoch));
-
- if(ooePredicate(recvset, outofelection, n)) {
- self.setPeerState((n.leader == self.getId()) ?
- ServerState.LEADING: learningState());
-
- Vote endVote = new Vote(n.leader,
- n.zxid,
- n.electionEpoch,
- n.peerEpoch);
- leaveInstance(endVote);
- return endVote;
- }
- }
-
- /*
- * Before joining an established ensemble, verify
- * a majority is following the same leader.
- */
- outofelection.put(n.sid, new Vote(n.version,
- n.leader,
- n.zxid,
- n.electionEpoch,
- n.peerEpoch,
- n.state));
-
- if(ooePredicate(outofelection, outofelection, n)) {
- synchronized(this){
- logicalclock = n.electionEpoch;
- self.setPeerState((n.leader == self.getId()) ?
- ServerState.LEADING: learningState());
- }
- Vote endVote = new Vote(n.leader,
- n.zxid,
- n.electionEpoch,
- n.peerEpoch);
- leaveInstance(endVote);
- return endVote;
- }
- break;
- default:
- LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
- n.state, n.sid);
- break;
- }
- } else {
- LOG.warn("Ignoring notification from non-cluster member " + n.sid);
- }
- }
- return null;
- } finally {
- try {
- if(self.jmxLeaderElectionBean != null){
- MBeanRegistry.getInstance().unregister(
- self.jmxLeaderElectionBean);
- }
- } catch (Exception e) {
- LOG.warn("Failed to unregister with JMX", e);
- }
- self.jmxLeaderElectionBean = null;
- }
- }
选举流程图如下:

上面讲解了快速的选举流程,那么选举中的数据是怎么交互的呢,下面来进行进一步的讲解:
在zookeeper的启动脚本zkServer.cmd可以看到有这么一行脚本内容:
- set ZOOMAIN=org.apache.zookeeper.server.quorum.QuorumPeerMain
- echo on
- call %JAVA% "-Dzookeeper.log.dir=%ZOO_LOG_DIR%" "-Dzookeeper.root.logger=%ZOO_LOG4J_PROP%" -cp "%CLASSPATH%" %ZOOMAIN% "%ZOOCFG%" %*
我们得知启动类为:org.apache.zookeeper.server.quorum.QuorumPeerMain,跟踪代码可以得知选举流程为:
FastLeaderElection类中的lookForLeader()方法,实际发生网络交互的地方为QuorumCnxManager类,类图关系如下两图:
具体说明:
QuorumCnxManager类为实际发生网络交互的地方,负责网络通讯中收集与发送投票信息,有类图关系中可以看到此类中有个叫Listener的内部类,此类负责保证连接的一对一以及启动两个线程进行投票消息的收发:sendWorker和recvWorker;
FastLeaderElection类中也有两个内部类负责投票信息的收发:WorkerSender和WorkerReceiver。
消息发送条线:选举方法lookForLeader()中发送投票时是将投票信息放入FastLeaderElection类中的sendqueue队列中,而WorkerSender(FastLeaderElection):负责将sendqueue队列中的信息放入QuorumCnxManager类中的queueSendMap中;而sendWorker(QuorumCnxManager):负责将QuorumCnxManager类中的queueSendMap中的投票信息发送到网络上。
消息接收条线:recvWorker(QuorumCnxManager):负责接收网络上的投票信息,并放入QuorumCnxManager类的recrQueue队列中;WorkerReceiver(FastLeaderElection):负责从QuorumCnxManager类中的recrQueue队列中获取数据,并放入FastLeaderElection类中的recvqueue队列中。
自己拷贝了一份3.4.9的源码并添加了些许注释:https://github.com/learnertogether/zookeeper-3.4.9.git