zookeeper(四)领导者选举

先上图:
在这里插入图片描述
首先看到这个图肯定就会很懵,确实比较多,体系结构也比较复杂,目前但是如果是根据源码调试加上这篇博客应该问题就不是很大。

我们首先从判定为集群开始说起:即从org.apache.zookeeper.server.quorum.QuorumPeerMain#runFromConfig开始。这里重点说一下quorumPeer.start();方法。

    @Override
    public synchronized void start() {
        //加载事务和快照。其实就是恢复的意思。主要就是把数据加载到内存,拿到纪元号之类的
        loadDataBase();
        //就是启动thread线程,跟单机的那个线程做的一样的事
        cnxnFactory.start();
        //领导者选举。(重要)
        //https://www.cnblogs.com/johnvwan/p/9546909.html 这篇博客讲得很好。
        startLeaderElection();
        //使用选举算法选出领导并开始同步。
        super.start();
    }

前面2个不说,就跟单机启动一模一样。重点是第三个和第四个。
跟踪到这个方法中:org.apache.zookeeper.server.quorum.QuorumPeer#startLeaderElection

    synchronized public void startLeaderElection() {
    	try {
    	    //参数:1、myid。2.最新的zxid。3、当前的纪元号
            //首先投自己一票。在投票箱中写上自己的信息。
    		currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
    	} catch(IOException e) {
    		RuntimeException re = new RuntimeException(e.getMessage());
    		re.setStackTrace(e.getStackTrace());
    		throw re;
    	}
    	//getView().values()就是获取到配置文件中的所有以server.开头的信息。包括observer
        for (QuorumServer p : getView().values()) {
            if (p.id == myid) {//如果id号是自己
                //为什么要有这个判断,因为你不但要告诉其它服务器投的谁,你还要告诉其它服务器是你投的
                myQuorumAddr = p.addr;
                break;
            }
        }
        if (myQuorumAddr == null) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
        }
        if (electionType == 0) {
            try {
                udpSocket = new DatagramSocket(myQuorumAddr.getPort());
                responder = new ResponderThread();
                responder.start();
            } catch (SocketException e) {
                throw new RuntimeException(e);
            }
        }
        //创建选举算法,默认是3.electionAlg可以更改。
        //此外还有0,1,2.但是现在一般都不推荐了。所以一般不会动
        //步骤如下:
        //1、初始化QuorumCnxManager
        //2、初始化QuorumCnxManager.Listener
        //3、运行QuorumCnxManager.Listener
        //4、运行QuorumCnxManager
        //5、返回FastLeaderElection对象
        //流程如下:
        //1.把自己的投票放入queueSendMap中:用于发送自己的投票信息。首先肯定是投自己。
        //2.queueSendMap中发送给其它服务器,如果是自己这一台就直接放到recvQueue中,如果是其它服务器就通过socket发送投票信息。
        //3.不断的检测recvQueue中的投票信息,如果在某一时刻recvQueue中过半的服务器(通过sid标识),那个被投票的那个人就设置为leader,其它就设置为follower。
        //服务器连接方式id大的去连接id小的,不允许小的连接大的。
        this.electionAlg = createElectionAlgorithm(electionType);
    }

这个方法的整体逻辑已经在方法注释中体现出来了。最复杂的是哪个方法呢?看注释就知道是createElectionAlgorithm这个方法了。

    protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;

        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
            case 0:
                le = new LeaderElection(this);
                break;
            case 1:
                le = new AuthFastLeaderElection(this);
                break;
            case 2:
                le = new AuthFastLeaderElection(this, true);
                break;
            case 3:
                //创建一个QuorumCnxManager,里面存在几个重要的属性。
                //ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap:
                //ConcurrentHashMap<Long, SendWorker> senderWorkerMap:就是用来记录其他服务器id以及对应的SendWorker的
                //ArrayBlockingQueue<Message> recvQueue:保存选票
                //QuorumCnxManager.Listener
                qcm = createCnxnManager();
                //姑且认为就是一个监听器。因为是一个线程。
                QuorumCnxManager.Listener listener = qcm.listener;
                if(listener != null){
                    listener.start();//启动监听器
                    //这里面比较复杂
                    le = new FastLeaderElection(this, qcm);
                } else {
                    LOG.error("Null listener when initializing cnx manager");
                }
                break;
            default:
                assert false;
        }
        return le;
    }

由于0 1 2都是不建议使用了,所以咱们就直接看3的情况。当然这里面哪里很复杂呢?当然是le = new FastLeaderElection(this, qcm);这个了。这个也是今天的重点,快速领导者选举,会很复杂,我们一步一步来。

FastLeaderElection初始化

    public FastLeaderElection(QuorumPeer self, QuorumCnxManager manager){
        this.stop = false;
        this.manager = manager;
        starter(self, manager);
    }

干了什么呢?就是初始化的赋值,然后进入starter方法。好像也没什么,我们看看这个方法。

starter(self, manager)

    private void starter(QuorumPeer self, QuorumCnxManager manager) {
        this.self = self;
        proposedLeader = -1;
        proposedZxid = -1;

        sendqueue = new LinkedBlockingQueue<ToSend>();
        recvqueue = new LinkedBlockingQueue<Notification>();
        this.messenger = new Messenger(manager);
    }

前面3行就是初始化的。重点是后面3行。看起来好像也没有什么,但是意义非常重大。你现在不需要记住什么,就只需要记住一点,初始化了两个queue。sendqueue 和recvqueue ,用处顾名思义。牢记,后面能用上。

this.messenger = new Messenger(manager)

        Messenger(QuorumCnxManager manager) {

            this.ws = new WorkerSender(manager);

            Thread t = new Thread(this.ws,
                    "WorkerSender[myid=" + self.getId() + "]");
            t.setDaemon(true);
            t.start();

            this.wr = new WorkerReceiver(manager);

            t = new Thread(this.wr,
                    "WorkerReceiver[myid=" + self.getId() + "]");
            t.setDaemon(true);
            t.start();
        }

看起也简单。new WorkerSender和new WorkerReceiver而已,然后就开子线程启动了。那么现在开始涉及到多线程了。我们先看WorkerSender,再看WorkerReceiver。

WorkerSender

            public void run() {
                while (!stop) {
                    try {
                        //这里取了是把所有的参与者都放进去了。什么时候放进去的呢?
                        // 通过sendNotifications函数放进去的。
                        ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
                        if(m == null) continue;
                        //开始处理
                        process(m);
                    } catch (InterruptedException e) {
                        break;
                    }
                }
                LOG.info("WorkerSender is down");
            }

sendqueue有印象没有,前面初始化提到过哦。这个是取方法。那么肯定就有放的方法。从哪里放的呢?当然你看到了我的注释。但是啥时候调用的这个方法?神奇不?
当然调试就知道啦!
一直往前推就能追溯到这个方法:org.apache.zookeeper.server.quorum.QuorumPeer#run,这个方式是什么时候启动的呢?第一个代码段的super.start();方法启动的。我们看这段代码:

    public void run() {
        setName("QuorumPeer" + "[myid=" + getId() + "]" +
                cnxnFactory.getLocalAddress());

        System.out.println("当前线程:" + Thread.currentThread().getName());
        LOG.debug("Starting quorum peer");
        //这个try的内容看不懂要干啥?
        try {
            jmxQuorumBean = new QuorumBean(this);
            MBeanRegistry.getInstance().register(jmxQuorumBean, null);
            for(QuorumServer s: getView().values()){
                ZKMBeanInfo p;
                if (getId() == s.id) {
                    p = jmxLocalPeerBean = new LocalPeerBean(this);
                    try {
                        MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                        jmxLocalPeerBean = null;
                    }
                } else {
                    p = new RemotePeerBean(s);
                    try {
                        MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                    }
                }
            }
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            jmxQuorumBean = null;
        }

        try {
            /*
             * Main loop
             */
            //这里我有一个误区,我以为break是直接跳出了while循环,但实际是只是跳出了switch,还是基础不够扎实。break不但用于跳出循环,也用于跳出switch。
            //测试文件:com.xq.test.TestWhile
            while (running) {
                switch (getPeerState()) {
                case LOOKING:
                    LOG.info("LOOKING");

                    if (Boolean.getBoolean("readonlymode.enabled")) {//如果开启了只读
                        LOG.info("Attempting to start ReadOnlyZooKeeperServer");

                        // Create read-only server but don't start it immediately
                        final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
                                logFactory, this,
                                new ZooKeeperServer.BasicDataTreeBuilder(),
                                this.zkDb);
    
                        // Instead of starting roZk immediately, wait some grace
                        // period before we decide we're partitioned.
                        //
                        // Thread is used here because otherwise it would require
                        // changes in each of election strategy classes which is
                        // unnecessary code coupling.
                        Thread roZkMgr = new Thread() {
                            public void run() {
                                try {
                                    // lower-bound grace period to 2 secs
                                    sleep(Math.max(2000, tickTime));
                                    if (ServerState.LOOKING.equals(getPeerState())) {
                                        roZk.startup();
                                    }
                                } catch (InterruptedException e) {
                                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                                } catch (Exception e) {
                                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                                }
                            }
                        };
                        try {
                            roZkMgr.start();
                            setBCVote(null);
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception",e);
                            setPeerState(ServerState.LOOKING);
                        } finally {
                            // If the thread is in the the grace period, interrupt
                            // to come out of waiting.
                            roZkMgr.interrupt();
                            roZk.shutdown();
                        }
                    } else {//没有开启只读
                        try {
                            setBCVote(null);
                            //不断更新投票,直到选出leader
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }
                    }
                    break;
                case OBSERVING:
                    try {
                        LOG.info("OBSERVING");
                        setObserver(makeObserver(logFactory));
                        observer.observeLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e );                        
                    } finally {
                        observer.shutdown();
                        setObserver(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                case FOLLOWING:
                    try {
                        LOG.info("FOLLOWING");
                        //产出一个适合follower的类
                        setFollower(makeFollower(logFactory));
                        follower.followLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        follower.shutdown();
                        setFollower(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                case LEADING:
                    LOG.info("LEADING");
                    try {
                        //产出一个适合leader的类
                        setLeader(makeLeader(logFactory));
                        //这个只是选择过半能接受的epoch,假设有更新的zxid没有启动
                        //在这个确定epoch的时候它启动了,那么这个leader会直接丢失多的数据吗?看代码好像是的
                        leader.lead();
                        setLeader(null);
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        if (leader != null) {
                            leader.shutdown("Forcing shutdown");
                            setLeader(null);
                        }
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                }
            }
        } finally {
            LOG.warn("QuorumPeer main thread exited");
            try {
                MBeanRegistry.getInstance().unregisterAll();
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            jmxQuorumBean = null;
            jmxLocalPeerBean = null;
        }
    }

从长度和注释就知道这是一块非常核心的代码。因为它就是真正的选举核心以及选举完了之后的角色分配。这段代码实在是非常复杂,建议结合下面的图理解。
为了便于理解再用文字补充一下:
我们所了解到的WorkerSender线程是从sendqueue取的数据。而我们现在就是在看是那里放的数据?放的应该是什么数据?这个应该理解一下就能想到是你投票的数据了。关于多线程这块需要有一定的基础才方便理解。
在这里插入图片描述
现在我们回到WorkerSender线程里面,再来回顾一下这段代码:

            public void run() {
                while (!stop) {
                    try {
                        //这里取了是把所有的参与者都放进去了。什么时候放进去的呢?
                        // 通过sendNotifications函数放进去的。
                        ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
                        if(m == null) continue;
                        //开始处理
                        process(m);
                    } catch (InterruptedException e) {
                        break;
                    }
                }
                LOG.info("WorkerSender is down");
            }

取出来之后重点就是process(m);处理了。

            void process(ToSend m) {
                ByteBuffer requestBuffer = buildMsg(m.state.ordinal(), 
                                                        m.leader,
                                                        m.zxid, 
                                                        m.electionEpoch, 
                                                        m.peerEpoch);
                //构造除byte的数组发送消息
                manager.toSend(m.sid, requestBuffer);
            }

主要是看怎么发送的,定位到 manager.toSend(m.sid, requestBuffer)。

    public void toSend(Long sid, ByteBuffer b) {
        /*
         * If sending message to myself, then simply enqueue it (loopback).
         */
        if (this.mySid == sid) {//如果是本机的话
             b.position(0);
             //直接添加到recvQueue中
             addToRecvQueue(new Message(b.duplicate(), sid));
            /*
             * Otherwise send to the corresponding thread to send.
             */
        } else {//不是本机的话
             /*
              * Start a new connection if doesn't have one already.
              */
             ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
             //putIfAbsent 如果传入key对应的value已经存在,就返回存在的value,不进行替换。如果不存在,就添加key和value,返回null
             ArrayBlockingQueue<ByteBuffer> bqExisting = queueSendMap.putIfAbsent(sid, bq);
             if (bqExisting != null) {//如果不是null,说明已经存在了。证明已经投过一票了。就把b加入到bqExisting中
                 addToSendQueue(bqExisting, b);
             } else {//如果为null,说明目前为止还没有投过。就把b加入到bq中。为啥添加到不同的队列中?
                 addToSendQueue(bq, b);
             }
             connectOne(sid);
                
        }
    }

这段代码有细节没懂,不过不是很影响大局。
对照着图下一个应该是SendWorker线程了。这个是啥时候被造出来的?
这个时候分为两种情况。
1.当前就一台启动了,因为目前就一台,所以不需要像其它服务器发消息,所以这个时候并不会造出这个线程,而如果第二台起来了,它才会造这个线程。下面给个调用链。它是怎么实现的?就是不断的探测能否与其它服务器通信,只有能通信才造线程。
org.apache.zookeeper.server.quorum.QuorumPeer#run—>org.apache.zookeeper.server.quorum.QuorumPeer#setCurrentVote–>org.apache.zookeeper.server.quorum.QuorumCnxManager#connectAll–>org.apache.zookeeper.server.quorum.QuorumCnxManager#connectOne–>org.apache.zookeeper.server.quorum.QuorumCnxManager#initiateConnection–>org.apache.zookeeper.server.quorum.QuorumCnxManager#startConnection

2.当前不是第一个启动的,即直接就能与其它服务器通信。
这个又需要追溯到前面了。QuorumCnxManager.Listener listener = qcm.listener;这个在本页面搜一下。在启动listener之后。在这里面就已经造好了SendWorker和RecvWorker线程。这个又是怎么实现的?因为一启动就打开了端口,此时另外一台在向这边发数据,所以是被迫造的线程!
org.apache.zookeeper.server.quorum.QuorumCnxManager.Listener#run–>org.apache.zookeeper.server.quorum.QuorumCnxManager#receiveConnection–>org.apache.zookeeper.server.quorum.QuorumCnxManager#handleConnection
RecvWorker线程也是在这个时候造出来的。
WorkerReceiver也比较类似。虽然细节不一样,但是大体流程也应该清楚了。
主要是要对投票的逻辑图熟悉。在看这些稍微好点。

为了完整性再来看看接收流程。就从前面的WorkerReceiver开始。

        Messenger(QuorumCnxManager manager) {

            this.ws = new WorkerSender(manager);

            Thread t = new Thread(this.ws,
                    "WorkerSender[myid=" + self.getId() + "]");
            t.setDaemon(true);
            t.start();

            this.wr = new WorkerReceiver(manager);

            t = new Thread(this.wr,
                    "WorkerReceiver[myid=" + self.getId() + "]");
            t.setDaemon(true);
            t.start();
        }

这个线程对应的run方法也是非常复杂,因为按照常理,我们每接到一个更新的投票就要做统计。统计什么呢?统计是否过半了,以及对其它情况的处理。
因为代码太多了。我们先对照流程图来看。
从流程图上可以看到的是WorkerReceiver线程是从recvQueue中取数据,并放到recvqueue中。也就是说WorkerReceiver做了一个中间转接的过程。知道的大体的情况我们再来看详细的代码。

          public void run() {

                Message response;
                while (!stop) {
                    // Sleeps on receive
                    try{
                        //从recvQueue中取出消息。
                        response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
                        if(response == null) continue;

                        /*
                         * If it is from an observer, respond right away.
                         * Note that the following predicate assumes that
                         * if a server is not a follower, then it must be
                         * an observer. If we ever have any other type of
                         * learner in the future, we'll have to change the
                         * way we check for observers.
                         */
                        //判断能否投票,因为observer是不能投票的,所以先验证是不是参与者
                        if(!validVoter(response.sid)){//如果不是参与者(如观察者)
                            Vote current = self.getCurrentVote();
                            ToSend notmsg = new ToSend(ToSend.mType.notification,
                                    current.getId(),
                                    current.getZxid(),
                                    logicalclock.get(),
                                    self.getPeerState(),
                                    response.sid,
                                    current.getPeerEpoch());
                            //直接把接收到的消息发出去。
                            sendqueue.offer(notmsg);
                        } else {//如果是参与者
                            // Receive new message
                            if (LOG.isDebugEnabled()) {
                                LOG.debug("Receive new notification message. My id = "
                                        + self.getId());
                            }

                            /*
                             * We check for 28 bytes for backward compatibility
                             */
                            //兼容性检查
                            if (response.buffer.capacity() < 28) {
                                LOG.error("Got a short response: "
                                        + response.buffer.capacity());
                                continue;
                            }
                            boolean backCompatibility = (response.buffer.capacity() == 28);
                            //为啥要清空?
                            response.buffer.clear();

                            // Instantiate Notification and set its attributes
                            Notification n = new Notification();
                            
                            // State of peer that sent this message
                            QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
                            switch (response.buffer.getInt()) {
                            case 0:
                                ackstate = QuorumPeer.ServerState.LOOKING;
                                break;
                            case 1:
                                ackstate = QuorumPeer.ServerState.FOLLOWING;
                                break;
                            case 2:
                                ackstate = QuorumPeer.ServerState.LEADING;
                                break;
                            case 3:
                                ackstate = QuorumPeer.ServerState.OBSERVING;
                                break;
                            default:
                                continue;
                            }
                            
                            n.leader = response.buffer.getLong();
                            n.zxid = response.buffer.getLong();
                            n.electionEpoch = response.buffer.getLong();
                            n.state = ackstate;
                            n.sid = response.sid;
                            if(!backCompatibility){
                                n.peerEpoch = response.buffer.getLong();
                            } else {
                                if(LOG.isInfoEnabled()){
                                    LOG.info("Backward compatibility mode, server id=" + n.sid);
                                }
                                n.peerEpoch = ZxidUtils.getEpochFromZxid(n.zxid);
                            }

                            /*
                             * Version added in 3.4.6
                             */

                            n.version = (response.buffer.remaining() >= 4) ? 
                                         response.buffer.getInt() : 0x0;

                            /*
                             * Print notification info
                             */
                            if(LOG.isInfoEnabled()){
                                printNotification(n);
                            }

                            /*
                             * If this server is looking, then send proposed leader
                             */

                            if(self.getPeerState() == QuorumPeer.ServerState.LOOKING) {//如果自己也在寻找leader的话
                                recvqueue.offer(n);//就放到recvqueue中

                                /*
                                 * Send a notification back if the peer that sent this
                                 * message is also looking and its logical clock is
                                 * lagging behind.
                                 */
                                //如果发送此消息的对方也正在LOOKING,并且其logical clock落后,则发送通知。
                                if((ackstate == QuorumPeer.ServerState.LOOKING)
                                        && (n.electionEpoch < logicalclock.get())){
                                    Vote v = getVote();
                                    ToSend notmsg = new ToSend(ToSend.mType.notification,
                                            v.getId(),
                                            v.getZxid(),
                                            logicalclock.get(),
                                            self.getPeerState(),
                                            response.sid,
                                            v.getPeerEpoch());
                                    sendqueue.offer(notmsg);
                                }
                            } else {//如果自己并不是寻找leader的话,则将自己认为的leader发送给对方。
                                /*
                                 * If this server is not looking, but the one that sent the ack
                                 * is looking, then send back what it believes to be the leader.
                                 */
                                Vote current = self.getCurrentVote();
                                System.out.println("当前认可的sid:"+current.getId());
                                if(ackstate == QuorumPeer.ServerState.LOOKING){
                                    if(LOG.isDebugEnabled()){
                                        LOG.debug("Sending new notification. My id =  " +
                                                self.getId() + " recipient=" +
                                                response.sid + " zxid=0x" +
                                                Long.toHexString(current.getZxid()) +
                                                " leader=" + current.getId());
                                    }
                                    
                                    ToSend notmsg;
                                    if(n.version > 0x0) {
                                        notmsg = new ToSend(
                                                ToSend.mType.notification,
                                                current.getId(),
                                                current.getZxid(),
                                                current.getElectionEpoch(),
                                                self.getPeerState(),
                                                response.sid,
                                                current.getPeerEpoch());
                                        
                                    } else {
                                        Vote bcVote = self.getBCVote();
                                        notmsg = new ToSend(
                                                ToSend.mType.notification,
                                                bcVote.getId(),
                                                bcVote.getZxid(),
                                                bcVote.getElectionEpoch(),
                                                self.getPeerState(),
                                                response.sid,
                                                bcVote.getPeerEpoch());
                                    }
                                    sendqueue.offer(notmsg);
                                }
                            }
                        }
                    } catch (InterruptedException e) {
                        System.out.println("Interrupted Exception while waiting for new message" +
                                e.toString());
                    }
                }
                LOG.info("WorkerReceiver is down");
            }
        }

提一提上述代码的关键细节:
1.manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS)方法怎么取的

    public Message pollRecvQueue(long timeout, TimeUnit unit)
       throws InterruptedException {
       return recvQueue.poll(timeout, unit);
    }

2.放入recvqueue中

                            if(self.getPeerState() == QuorumPeer.ServerState.LOOKING) {//如果自己也在寻找leader的话
                                recvqueue.offer(n);//就放到recvqueue中

                                /*
                                 * Send a notification back if the peer that sent this
                                 * message is also looking and its logical clock is
                                 * lagging behind.
                                 */
                                //如果发送此消息的对方也正在LOOKING,并且其logical clock落后,则发送通知。
                                if((ackstate == QuorumPeer.ServerState.LOOKING)
                                        && (n.electionEpoch < logicalclock.get())){
                                    Vote v = getVote();
                                    ToSend notmsg = new ToSend(ToSend.mType.notification,
                                            v.getId(),
                                            v.getZxid(),
                                            logicalclock.get(),
                                            self.getPeerState(),
                                            response.sid,
                                            v.getPeerEpoch());
                                    sendqueue.offer(notmsg);
                                }
                            }

至于RecvWorker线程就自己取调试,比较简单。

最后我们来说说它的验证。这个也非常复杂了。看代码。
org.apache.zookeeper.server.quorum.FastLeaderElection#lookForLeader这个方法是在什么时候调用的呢?就是org.apache.zookeeper.server.quorum.QuorumPeer#run这个线程里面。这里面就描述了找leader的流程。

    //领导者选举
    public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
            self.start_fle = Time.currentElapsedTime();
        }
        try {
            //初始化一个投票的map
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet();
                //更新选票,这个时候一直投的自己
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            //向其它服务器发送选票
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */
            //不断获取其他服务器的投票信息,直到选出Leader
            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)) {
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){//如果取到的为null,说明没有取到消息。
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                else if(validVoter(n.sid) && validVoter(n.leader)) {
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view for a replica in the voting view.
                     */
                    //处理投票
                    switch (n.state) {
                        case LOOKING:
                            // If notification > current, replace and send messages out
                            //如果参加选举的纪元(Epoch)>当前逻辑的纪元,说明参加选举的肯定比现在选举的更新,更有说服力
                            //所以当前就把纪元更新一下。清空自己得到的投票(比如唐代的尚方宝剑肯定不能拿到清朝使用,一个道理)。
                            //然后就投那个参加选举的纪元。
                            if (n.electionEpoch > logicalclock.get()) {
                                logicalclock.set(n.electionEpoch);
                                recvset.clear();
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                    updateProposal(n.leader, n.zxid, n.peerEpoch);
                                } else {
                                    updateProposal(getInitId(),
                                            getInitLastLoggedZxid(),
                                            getPeerEpoch());
                                }
                                //通知其它服务器
                                sendNotifications();
                            } else if (n.electionEpoch < logicalclock.get()) {//如果比当前小,证明参加选举的已经过时,就直接忽略这个投票信息。
                                if(LOG.isDebugEnabled()){
                                    LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                            + Long.toHexString(n.electionEpoch)
                                            + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                                }
                                break;
                            } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    proposedLeader, proposedZxid, proposedEpoch)) {
                                //这个时候epoch肯定一致了。所以就来比较一下zxid,谁得zxid越新就投给谁。
                                //如果连zxid都一致,那就看sid(即serverid),这个肯定是能区分的,所以就确定了投票。
                                //但是如果自己跟自己比,就没有意义了,所以自己就不进。
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                                sendNotifications();
                            }

                            if(LOG.isDebugEnabled()){
                                LOG.debug("Adding vote: from=" + n.sid +
                                        ", proposed leader=" + n.leader +
                                        ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                        ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                            }

                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                            ///判断是否投票结束,就是判断是否有过半的服务器投的同一个。
                            if (termPredicate(recvset,
                                    new Vote(proposedLeader, proposedZxid,
                                            logicalclock.get(), proposedEpoch))) {

                                // Verify if there is any change in the proposed leader
                                //验证投票中过半的投票是否有变动
                                while((n = recvqueue.poll(finalizeWait,
                                        TimeUnit.MILLISECONDS)) != null){
                                    if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                            proposedLeader, proposedZxid, proposedEpoch)){
                                        recvqueue.put(n);
                                        break;
                                    }
                                }

                                /*
                                 * This predicate is true once we don't read any new
                                 * relevant message from the reception queue
                                 */
                                //上面的while一直从recvqueue取,直到取为空,并且还要符合过半机制才结束,所以这里的n基本就是null
                                //然后就分配各自的角色。
                                if (n == null) {
                                    self.setPeerState((proposedLeader == self.getId()) ?
                                            ServerState.LEADING: learningState());

                                    Vote endVote = new Vote(proposedLeader,
                                            proposedZxid,
                                            logicalclock.get(),
                                            proposedEpoch);
                                    leaveInstance(endVote);
                                    return endVote;
                                }
                            }
                            break;
                        case OBSERVING:
                            LOG.debug("Notification from observer: " + n.sid);
                            break;
                        case FOLLOWING:
                        case LEADING:
                            /*
                             * Consider all notifications from the same epoch
                             * together.
                             */
                            if(n.electionEpoch == logicalclock.get()){
                                recvset.put(n.sid, new Vote(n.leader,
                                        n.zxid,
                                        n.electionEpoch,
                                        n.peerEpoch));

                                if(ooePredicate(recvset, outofelection, n)) {
                                    self.setPeerState((n.leader == self.getId()) ?
                                            ServerState.LEADING: learningState());

                                    Vote endVote = new Vote(n.leader,
                                            n.zxid,
                                            n.electionEpoch,
                                            n.peerEpoch);
                                    leaveInstance(endVote);
                                    return endVote;
                                }
                            }

                            /*
                             * Before joining an established ensemble, verify
                             * a majority is following the same leader.
                             */
                            outofelection.put(n.sid, new Vote(n.version,
                                    n.leader,
                                    n.zxid,
                                    n.electionEpoch,
                                    n.peerEpoch,
                                    n.state));

                            if(ooePredicate(outofelection, outofelection, n)) {
                                synchronized(this){
                                    logicalclock.set(n.electionEpoch);
                                    self.setPeerState((n.leader == self.getId()) ?
                                            ServerState.LEADING: learningState());
                                }
                                Vote endVote = new Vote(n.leader,
                                        n.zxid,
                                        n.electionEpoch,
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                            break;
                        default:
                            LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                    n.state, n.sid);
                            break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}",
                    manager.getConnectionThreadCount());
        }
    }

看看上面代码中的totalOrderPredicate方法。为什么要说这个方法呢?因为这个方法多次出现,所以非常重要,这个方法就是拿来判断别的服务器是否比你更好。判断逻辑就是:Epoch,Zxid,myid依次比较,直到比较出一个大小出来。

    protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }
        
        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */
        
        return ((newEpoch > curEpoch) || 
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
    }

那么哪里判断是结束投票呢?org.apache.zookeeper.server.quorum.FastLeaderElection#termPredicate这个方法中。

    protected boolean termPredicate(
            HashMap<Long, Vote> votes,
            Vote vote) {

        HashSet<Long> set = new HashSet<Long>();

        /*
         * First make the views consistent. Sometimes peers will have
         * different zxids for a server depending on timing.
         */
        for (Map.Entry<Long,Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())){
                set.add(entry.getKey());
            }
        }
		//这里的containsQuorum具体在下面就是它的实现
        return self.getQuorumVerifier().containsQuorum(set);
    }
    public boolean containsQuorum(Set<Long> set){
        return (set.size() > half);
    }

分配完后就要确定各自的角色了,值得一提的是它并不是过半马上就设置了,而是做了严谨的判断。
它确定好有leader产生后就会再次判断,是否还是这样的结果。而且会取完recvqueue的所有数据后确定是这个了才会分配角色。recvqueue取完就表示一件事情,其它服务器没有发送投票了,证明他们也选举了leader。所以在这个时候确定leader就能完全判断了。
有人可能会问万一服务器选出来的leader不同呢?
这个问题不存在的,首先过半验证就是说有一半以上的服务器选的同一个,那么想两个服务器选出不一样的leader,那么发送的数据永远不可能存在另一个过半的服务器(当然你非要说数据被篡改了之类的就不用说了,凡事无绝对),即不可能存在这种情况,而且必须是recvqueue空了后才设置角色的,说明基本要所有服务器确认了leader才能为空。所以选出来的一定是同一个leader!!
后面再说选举完了的流程,这篇先到这里,够长了。后面主要了解,设置角色之后各自需要做什么工作,发送一个请求的流程是怎样的?一致性是怎么保证的等等。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值