zookeeper源码解析（二选举算法源码思路与源码解析）

Zookeeper Leader选举解析

最新推荐文章于 2025-09-22 09:50:20 发布

原创最新推荐文章于 2025-09-22 09:50:20 发布 · 206 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#zookeeper

zookeeper 专栏收录该内容

8 篇文章

订阅专栏

本文深入剖析Zookeeper的FastLeaderElection类，详述选举前准备、自我投票、比较及票数统计流程，解读核心方法lookForLeader()，并解析源码结构与重要参数。

选举算法源码思路

Zookeeper的Leader选举类是FastLeaderElection，该类是ZAB协议在Leader选举中的工程应用，所以直接找到该类对其进行分析。该类中最为重要的方法为lookForLeader()，是选举Leader的核心方法。该方法大体思路可以划分为以下几块：

1 选举前的准备工作

选举前需要做一些准备工作，例如，创建选举对象、创建选举过程中需要用到的集合、初始化选举时限等。

2 将自己作为初始化Leader投出去

在当前Server第一次投票时会先将自己作为Leader，然后将自己的选票广播给其它所有Server。

3 验证自己的投票与大家的投票谁更适合做Leader

在“我选我”后，当前Server同样会接收到其它Server发送来的选票通知(Notification)。通过while循环，遍历所有接收到的选票通知，比较谁更适合做Leader。若找到一个比自己更适合的Leader，则修改自己选票，重新将新的选票广播出去。当然，每验证一个选票，则会将其记录到一个集合中，将来用于进行票数统计。

4 判断本轮选举是否应结束

其实在每次验证过谁更适合做Leader后，就会马上判断当前的选举是否可以结束了，即当前主机所推荐的这个选票是否过半了。若过半了，则直接完成后续的一些收尾工作，例如清空选举过程中所使用的集合，以备下次使用；再例如，生成最终的选票，以备其它Server来同步数据。若没有过半，则继续从队列中读取出下一个来自于其它主机的选票，然后进行验证。

选举算法源码解析

对源码的阅读主要包含两方面。一个是对重要类、重要成员变量、重要方法的注释的阅读；一个是对重要方法的业务逻辑的分析。

1 源码结构

2.leader选举源码

FastLeaderElection部分代码

/**
 * Implementation of leader election using TCP. It uses an object of the class
 * QuorumCnxManager to manage connections. Otherwise, the algorithm is push-based
 * as with the other UDP implementations.
 *
 * There are a few parameters that can be tuned to change its behavior. First,
 * finalizeWait determines the amount of time to wait until deciding upon a leader.
 * This is part of the leader election algorithm.
 *
 * 使用TCP实现leader领导人选举,它使用QuorumCnxManager类的一个对象管理连接(与其他server间的连接管理)。
 * 否则(如果不使用QuorumCnxManager对象的话)，将使用UDP基于推送的算法实现。
 *
 * 有几个参数可以用来改变它(选举)的行为。
 * 首先,finalizeWait(一个代码中的常量)决定选举一个leader的时间，
 * 这是leader选举算法的一部分。
 */


public class FastLeaderElection implements Election {
    private static final Logger LOG = LoggerFactory.getLogger(FastLeaderElection.class);

    /**
     * Determine how much time a process has to wait
     * once it believes that it has reached the end of
     * leader election.
     * (此常量)决定一个选举过程需要等待的选举时间；
     * 一经到达，将结束leader选举。
     * 默认200毫秒
     * 实际此时间是节点等待收取其他节点选举消息(Notification)的时间；
     */
    final static int finalizeWait = 200;


    /**
     * Upper bound on the amount of time between two consecutive
     * notification checks. This impacts the amount of time to get
     * the system up again after long partitions. Currently 60 seconds.
     * (此常量)指定两个连续的notification检查的时间间隔上线；
     * 其影响了系统在经历了长时间分割后再次重启的时间，默认60秒。
     * 此常量其实就是finalizeWait的最大值，代表如果在60秒内还没有选举出leader，
     * 那么重新发起一轮选举；
     */

    final static int maxNotificationInterval = 60000;

    /**
     * Connection manager. Fast leader election uses TCP for
     * communication between peers, and QuorumCnxManager manages
     * such connections.
     * 连接管理者(类)。FastLeaderElection选举算法使用TCP(管理)
     * 两个同辈server的通信，并且QuorumCnxManager还管理着这些连接。
     */

    QuorumCnxManager manager;


    /**
     * Notifications are messages that let other peers know that
     * a given peer has changed its vote, either because it has
     * joined leader election or because it learned of another
     * peer with higher zxid or same zxid and higher server id
     * Notifications是一个让其它server知道当前server已经改变了
     * 投票的通知消息（为什么要改变投票？），要么是因为它参与了leader
     * 选举(新一轮投票，首先投给自己)，要么是它具有更大的zxid，或者
     * zxid相同但是ServerID(myid)更大。
     */

    static public class Notification {
        /*
         * Format version, introduced in 3.4.6
         */
        
        public final static int CURRENTVERSION = 0x1; 
        int version;
                
        /*
         * Proposed leader
         * 当前选票所推荐做leader的ServerID
         */
        long leader;

        /*
         * zxid of the proposed leader
         * 当前选票所推荐做leader的最大zxid
         */
        long zxid;

        /*
         * Epoch
         * 当前本轮选举的epoch，即逻辑时钟
         */
        long electionEpoch;

        /*
         * current state of sender
         * 当前通知的发送者的状态(四种状态)
         */
        QuorumPeer.ServerState state;

        /*
         * Address of sender
         * 当前通知发送者的serverID
         */
        long sid;

        /*
         * epoch of the proposed leader
         * 当前选票所推荐做leader的epoch
         */
        long peerEpoch;
            
            .
            .
            .
            .
            .
            .
    QuorumPeer self;  //当前参与选举的server(当前主机)
    Messenger messenger;
    //logicalclock逻辑时钟，原子整型
    AtomicLong logicalclock = new AtomicLong(); /* Election instance */
    //记录当前server的推荐情况
    long proposedLeader;
    long proposedZxid;
    long proposedEpoch;

            .
            .
            .
            .
            .
            .

QuorumPeer

/**
 * This class manages the quorum protocol. There are three states this server
 * can be in:
 * <ol>
 * <li>Leader election - each server will elect a leader (proposing itself as a
 * leader initially).</li>
 * <li>Follower - the server will synchronize with the leader and replicate any
 * transactions.</li>
 * <li>Leader - the server will process requests and forward them to followers.
 * A majority of followers must log the request before it can be accepted.
 * </ol>
 *
 * This class will setup a datagram socket that will always respond with its
 * view of the current leader. The response will take the form of:
 * 这个类管理着“法定人数投票”协议；这个服务器有三种状态：
 * (1) Leader election：（处于该状态）每一个服务器将选举一个Leader（最初推荐自己
 * 作为Leader）。(此状态即为LOOKING状态)
 * (2)Folloer:（处于该状态的）服务器将与Leader做同步，并复制所有的事务（注意这里
 * 的事务指的是最终的提议Proposal；txid中的tx即为事务）。
 * (3)Leader：（处于该状态的）服务器将处理请求，并将这些请求转发给其他Follower，大
 * 多数Follower在该写请求被批准之前（before it can be accepted）都必须要记录下该
 * 请求（注意，这里的请求指的是写请求，Leader在接收到写请求后会向所有Follower发出
 * 提议，在大多数Follower同意后该写请求才会被批准执行）。
 * <pre>
 * int xid;
 *
 * long myid;
 *
 * long leader_id;
 *
 * long leader_zxid;
 * </pre>
 *
 * The request for the current leader will consist solely of an xid: int xid;
 */
public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider {
    private static final Logger LOG = LoggerFactory.getLogger(QuorumPeer.class);

    QuorumBean jmxQuorumBean;
    LocalPeerBean jmxLocalPeerBean;
    LeaderElectionBean jmxLeaderElectionBean;
    QuorumCnxManager qcm;
    QuorumAuthServer authServer;
    QuorumAuthLearner authLearner;
    // VisibleForTesting. This flag is used to know whether qLearner's and
    // qServer's login context has been initialized as ApacheDS has concurrency
    // issues. Refer https://issues.apache.org/jira/browse/ZOOKEEPER-2712
    private boolean authInitialized = false;

            .
            .
            .
            .
            .
            .

QuorumCnxManager

/**
 * This class implements a connection manager for leader election using TCP. It
 * maintains one connection for every pair of servers. The tricky part is to
 * guarantee that there is exactly one connection for every pair of servers that
 * are operating correctly and that can communicate over the network.
 * 这个类使用TCP实现了一个用于Leader选举的连接管理器。
 * 它为每一对服务器维护着一个连接。棘手的部分在于确保[为每对服务器正确地操作
 * 并且可以与整个网络进行通信的]连接恰有一个。
 *
 * If two servers try to start a connection concurrently, then the connection
 * manager uses a very simple tie-breaking mechanism to decide which connection
 * to drop based on the IP addressed of the two parties. 
 * 如果两个服务器试图同时启动一个连接，则连接管理器使用非常简单的中断连接
 * 机制来决定哪个中断，基于双方的IP地址。
 *
 *
 * For every peer, the manager maintains a queue of messages to send. If the
 * connection to any particular peer drops, then the sender thread puts the
 * message back on the list. As this implementation currently uses a queue
 * implementation to maintain messages to send to another peer, we add the
 * message to the tail of the queue, thus changing the order of messages.
 * Although this is not a problem for the leader election, it could be a problem
 * when consolidating peer communication. This is to be verified, though.
 * 对于每个对等体（sever），管理器维护着一个消息发送队列。如果连接到任何
 * 特定的Server中断，那么发送者线程将消息放回到这个队列中。
 * 作为这个实现，当前使用一个队列来实现维护发送给另一方的消息，因此我们将消息
 * 添加到队列的尾部，从而更改了消息的顺序。虽然对于Leader选举来说这不是一个问题，
 * 但对于加强对等通信可能就是个问题。不过，这一点有待验证。
 */

public class QuorumCnxManager {
    private static final Logger LOG = LoggerFactory.getLogger(QuorumCnxManager.class);

    /*
     * Maximum capacity of thread queues
     */
    static final int RECV_CAPACITY = 100;
    // Initialized to 1 to prevent sending
    // stale notifications to peers
    static final int SEND_CAPACITY = 1;

    static final int PACKETMAXSIZE = 1024 * 512;

    /*
     * Max buffer size to be read from the network.
     */
    static public final int maxBuffer = 2048;
    
    /*
     * Negative counter for observer server ids.
     */
    
    private AtomicLong observerCounter = new AtomicLong(-1);
    
    /*
     * Connection time out value in milliseconds 
     */
    
    private int cnxTO = 5000;

消息发送队列：QuorumCnxManager