Zookeeper源码解析:2、选举流程分析

接着上篇启动流程分析,我们继续来看zk是如何进行选举的。

在上篇文章文章中,我们可以了解到启动流程中有几处地方涉及到选举

  • org.apache.zookeeper.server.quorum.QuorumPeer中的start方法中调用了startLeaderElection()创建了一些选举需要用到的必要对象
  • org.apache.zookeeper.server.quorum.QuorumPeer中的run方法死循环中,如果当前是LOOKING状态,会调用选举算法(默认:FastLeaderElection)的lookForLeader()方法触发选举

1、首先我们先来去看startLeaderElection()方法

类:org.apache.zookeeper.server.quorum.QuorumPeer

	synchronized public void startLeaderElection() {
    	try {
    		currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
    	} catch(IOException e) {
    		RuntimeException re = new RuntimeException(e.getMessage());
    		re.setStackTrace(e.getStackTrace());
    		throw re;
    	}
        for (QuorumServer p : getView().values()) {
            if (p.id == myid) {
                myQuorumAddr = p.addr;
                break;
            }
        }
        if (myQuorumAddr == null) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
        }
        // 如果electionType=0,那么创建UDP进行传输
        if (electionType == 0) {
            try {
                udpSocket = new DatagramSocket(myQuorumAddr.getPort());
                responder = new ResponderThread();
                responder.start();
            } catch (SocketException e) {
                throw new RuntimeException(e);
            }
        }
        // 根据electionType创建对应的选举算法
        this.electionAlg = createElectionAlgorithm(electionType);
    }
    protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;
                
        //TODO: use a factory rather than a switch
        // 根据electionAlgorithm创建对应的选举算法,如果没有设置默认为3,而且其他两种算法已经废弃,所以我们只说第三种情况
        switch (electionAlgorithm) {
        case 0:
            le = new LeaderElection(this);
            break;
        case 1:
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
       		// 首先创建选举通信管理器
            qcm = new QuorumCnxManager(this);            
            QuorumCnxManager.Listener listener = qcm.listener;           
            if(listener != null){
            	// 启动监听器,该监听器主要接受服务器之间的选举投票,并且将通信层获取的投票传递至选举逻辑层
                listener.start();
                // 创建选举算法
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

选举通信管理器的创建

类:org.apache.zookeeper.server.quorum.QuorumCnxManager

    public QuorumCnxManager(QuorumPeer self) {
    	// 存储通信层接受到其他服务器投票的队列
        this.recvQueue = new ArrayBlockingQueue<Message>(RECV_CAPACITY);
        // key:服务器id,value:需要发送的投票队列
        this.queueSendMap = new ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>>();
        // key:服务器id,value:通信对象(Socket)
        this.senderWorkerMap = new ConcurrentHashMap<Long, SendWorker>();
        // key:服务器id,value:存储发送的最后一条消息
        this.lastMessageSent = new ConcurrentHashMap<Long, ByteBuffer>();
        
        String cnxToValue = System.getProperty("zookeeper.cnxTimeout");
        if(cnxToValue != null){
            this.cnxTO = new Integer(cnxToValue); 
        }
        
        this.self = self;

        // Starts listener thread that waits for connection requests 
        // 创建一个监听器,里面会创建一个ServerSocket监听选举端口,接受别的服务器的连接和投票
        listener = new Listener();
    }

FastLeaderElection的创建
类:org.apache.zookeeper.server.quorum.FastLeaderElection

    public FastLeaderElection(QuorumPeer self, QuorumCnxManager manager){
        this.stop = false;
        this.manager = manager;
        starter(self, manager);
    }
    private void starter(QuorumPeer self, QuorumCnxManager manager) {
        this.self = self;
        proposedLeader = -1;
        proposedZxid = -1;
		
		// 选举逻辑层发送投票队列
        sendqueue = new LinkedBlockingQueue<ToSend>();
        // 选举逻辑层接收投票队列
        recvqueue = new LinkedBlockingQueue<Notification>();
 		// 创建Messenger对象,该对象主要创建了2个监听上述队列的线程
        this.messenger = new Messenger(manager);
    }
	Messenger(QuorumCnxManager manager) {
		 /*
		  * 创建WorkerSender线程,监听sendqueue队列,主要接收逻辑层的	
		  * 投票,向选举通信层传输
		  */
	     this.ws = new WorkerSender(manager);
	
	     Thread t = new Thread(this.ws,
	             "WorkerSender[myid=" + self.getId() + "]");
	     t.setDaemon(true);
	     t.start();
	
		 /*
		  * 创建WorkerReceiver线程,主要接收选举通信层其他服务器
		  * 的投票,接收到的投票放入recvqueue队列中
		  */
	     this.wr = new WorkerReceiver(manager);
	
	     t = new Thread(this.wr,
	             "WorkerReceiver[myid=" + self.getId() + "]");
	     t.setDaemon(true);
	     t.start();
	 }

从上面的创建对象流程我们可以大致了解到,选举可以分为2个阶段分别为逻辑层(将通信层接收到的投票进行一些处理),通信层(主要socket操作)。

2、接下来我们看一下FastLeaderElection.lookForLeader()选举逻辑

    public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = System.currentTimeMillis();
        }
        try {
        	/*
        	 * 创建一个当前选举周期的投票箱
        	 */
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
			
			/*
			 * 创建一个投票箱。这个投票箱和recvset 不一样。
			 * 存储当前集群中如果已经存在Leader了的投票
			 */
            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
            	// 递增本地选举周期
                logicalclock++;
                // 为自己投票
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            // 广播投票
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */
			// 如果当前服务器的状态为Looking,和stop参数为false,那么进行选举
            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                // 监听通信层接收的投票
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

          		// 如果一定时间内没有接收到投票
                if(n == null){
                	// 判断通信层是否存在发送队列为空
                    if(manager.haveDelivered()){
                    	/*
                    	 * 如果存在一个队列为空
                    	 * 继续广播投票
                    	 */ 
                        sendNotifications();
                    } else {
                    	/*
                    	 * 如果队列都不为空,那么说明和集群内所有的服
                    	 * 务器都连接不成功,或者集群内的服务器没有
                    	 * 启动,触发连接重试
                    	 */ 
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                // 判断投票的服务器是否有投票权,因为OBSERVER没有选举权
                else if(self.getVotingView().containsKey(n.sid)) {
					// 根据接收的投票进行对应的逻辑操作
                    switch (n.state) {
                    case LOOKING:
                        // If notification > current, replace and send messages out
                        // 如果接收的投票的选举周期比当前服务器周期大
                        if (n.electionEpoch > logicalclock) {
                        	// 将本地选举周期更新为接收的投票的选举周期
                            logicalclock = n.electionEpoch;
                            // 清空投票箱
                            recvset.clear();
                            // 进行投票PK
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                          // 投票PK成功,更新当前的投票为接收到投票
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                         // 投票PK失败,更新当前的选票为自己的
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            // 广播投票
                            sendNotifications();
                          // 如果接收的投票的选举周期比当前服务器小
                        } else if (n.electionEpoch < logicalclock) {
                        	// 那么直接丢弃,什么都不干
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock));
                            }
                            break;
                       // 如果接收的投票的选举周期和当前服务器一样
                       // 那么进行投票PK
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            // PK成功,更新当前的投票
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            // 广播投票
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }
						// 将接收的投票加入投票箱
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
						// 判断当前的投票是否超过一半
                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock, proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                     /*
                      * 继续监听通信层接收的投票,可能存在投票
                      * 结果变化,所以这里需要继续监听,监听直到超时
                      */
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                // 接收新的投票,进行投票PK
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    // 如果投票PK成功,加入队列,说明结果可能会有变化,跳出循环
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            /*
                             * 判断之前接收的投票是否为空,
                             */
                            if (n == null) {
								/*
								 * 如果为空,说明投票结果不会变化了。
								 * 根据最终胜出的投票来确定自身状态
								 * LEADING、FOLLOWING、OBSERVING								 
								 */
								 
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock,
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                // 返回最终胜出的投票
                                return endVote;
                            }
                        }
                        /*
                         * 如果不为空,那么说明投票结果可能变化了
                         * 需要跳出循环,继续新的一轮判断
                         */
                        break;
                    case OBSERVING:
                   	// 如果接收的投票是OBSERVING,直接丢弃
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                    // 如果接收的投票是FOLLOWING或者LEADING
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                         // 判断选举周期和当前选举周期是否相等
                        if(n.electionEpoch == logicalclock){
                        	// 加入投票箱
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                             // 验证是否超过半数Follower都追随同一个Leader                     
                            if(ooePredicate(recvset, outofelection, n)) {
                             // 如果是,那么根据胜出的投票,更新自身状态
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                // 返回最终的投票
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        /*
                         * 将选票加入outofelection,需要后续需要验证
                         * 是否多个Follower都追随同一个Leader
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
                        /*
                         * 验证是否超过半数Follower都追随同一个Leader
                         */
                        if(ooePredicate(outofelection, outofelection, n)) {							
                         // 如果是,那么集群中已经有Leader
                            synchronized(this){
                            // 根据最终投票,来确定自身状态
                                logicalclock = n.electionEpoch;
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            //  返回最终投票
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }
                } else {
                    LOG.warn("Ignoring notification from non-cluster member " + n.sid);
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
        }
    }

通过上面分析,有几个比较重要的方法

  • 投票PK
  • 广播投票
  • 从选举通信层接受投票

下述说的sid实际上就是每台服务器配置的myid

1、投票PK
类:org.apache.zookeeper.server.quorum.FastLeaderElection

    protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }
        
        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */
        /*
         * 1、若外部的选举轮次大于内部选举轮次,那么外部选票PK成功
         * 2、若选举轮次相等,若外部的zxid大于内部zxid,那么外部选票PK成功
         * 3、若选举轮次相等,zxid也相等,那么比较两者的sid(服务器id),若外部比较大,那么外部选票PK成功
         * 如果外部选票PK成功,都需要变更自身的选票,然后广播出去
         */
        return ((newEpoch > curEpoch) || 
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
    }

2、广播投票
类:org.apache.zookeeper.server.quorum.FastLeaderElection

    private void sendNotifications() {
    	// 获取所有具有投票权的服务器列表,即不是Observer类型的
        for (QuorumServer server : self.getVotingView().values()) {            
            long sid = server.id;
			// 构建发送的投票消息
            ToSend notmsg = new ToSend(ToSend.mType.notification,
                    proposedLeader,
                    proposedZxid,
                    logicalclock,
                    QuorumPeer.ServerState.LOOKING,
                    sid,
                    proposedEpoch);
            if(LOG.isDebugEnabled()){
                LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                      Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock)  +
                      " (n.round), " + sid + " (recipient), " + self.getId() +
                      " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
            }
            // 将发送的投票放入选举逻辑层发送队列
            sendqueue.offer(notmsg);
        }
    }

类:org.apache.zookeeper.server.quorum.FastLeaderElection.Messenger.WorkerSender

        class WorkerSender implements Runnable {
            volatile boolean stop;
            QuorumCnxManager manager;

            WorkerSender(QuorumCnxManager manager){
                this.stop = false;
                this.manager = manager;
            }

            public void run() {
                while (!stop) {
                    try {
                    	// 从sendqueue队列取出数据
                        ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
                        // 如果取出的数据为空,那么继续轮训
                        if(m == null) continue;
						
						// 如果取出的数据不为空,那么调用process方法执行
                        process(m);
                    } catch (InterruptedException e) {
                        break;
                    }
                }
                LOG.info("WorkerSender is down");
            }
            void process(ToSend m) {
            	// 将实体转换成ByteBuffer进行发送
                ByteBuffer requestBuffer = buildMsg(m.state.ordinal(), 
                                                        m.leader,
                                                        m.zxid, 
                                                        m.electionEpoch, 
                                                        m.peerEpoch);
                // 通过通信管理器进行发送投票
                manager.toSend(m.sid, requestBuffer);
            }
        }

类:org.apache.zookeeper.server.quorum.QuorumCnxManager

    public void toSend(Long sid, ByteBuffer b) {
    	// 判断发送的投票是否是对自己的投票
        if (self.getId() == sid) {
       		// 如果是,直接将发送数据放入选举逻辑层接受队列(recvQueue)
             b.position(0);
             addToRecvQueue(new Message(b.duplicate(), sid));
            /*
             * Otherwise send to the corresponding thread to send.
             */
        } else {

              /*
               * 判断是否存在该sid的发送消息队列
               * 一个服务器对应一个发送消息队列
               */
             if (!queueSendMap.containsKey(sid)) {
             	// 如果没有,创建一个定长的发送消息队列
                 ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(
                         SEND_CAPACITY);
                 // 将该队列加入queueSendMap中。key:sid。value:发送消息队列
                 queueSendMap.put(sid, bq);
                 // 将需要发送的ByteBuffer放入发送消息队列
                 addToSendQueue(bq, b);

             } else {
             	 // 说明已经存在发送小心队列,那么直接从map中取出即可
                 ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
                 if(bq != null){
                 	 // 将需要发送的ByteBuffer放入发送消息队列
                     addToSendQueue(bq, b);
                 } else {
                     LOG.error("No queue for server " + sid);
                 }
             }
             // 连接sid对应服务器
             connectOne(sid);                
        }
    }
    synchronized void connectOne(long sid){
    	// 从map中取出SendWorker(投票通信对象,封装了Socket)
        if (senderWorkerMap.get(sid) == null){
			// 如果不存在,那么需要创建对应的对象
        	// 首先取出sid服务器对应的选举端口
            InetSocketAddress electionAddr;
            // 判断配置文件中是否配置了该sid
            if (self.quorumPeers.containsKey(sid)) {            	
            	// 如果有配置,那么直接从容器中根据sid取出地址和端口
                electionAddr = self.quorumPeers.get(sid).electionAddr;
            } else {
            	// 没有配置 直接返回
                LOG.warn("Invalid server id: " + sid);
                return;
            }
            try {

                if (LOG.isDebugEnabled()) {
                    LOG.debug("Opening channel to server " + sid);
                }
                // 创建socket,连接对应的服务器
                Socket sock = new Socket();
                setSockOpts(sock);
                sock.connect(self.getView().get(sid).electionAddr, cnxTO);
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Connected to server " + sid);
                }
                // 初始化连接配置
                initiateConnection(sock, sid);
            } catch (UnresolvedAddressException e) {
                // Sun doesn't include the address that causes this
                // exception to be thrown, also UAE cannot be wrapped cleanly
                // so we log the exception in order to capture this critical
                // detail.
                LOG.warn("Cannot open channel to " + sid
                        + " at election address " + electionAddr, e);
                throw e;
            } catch (IOException e) {
                LOG.warn("Cannot open channel to " + sid
                        + " at election address " + electionAddr,
                        e);
            }
        } else {
            LOG.debug("There is a connection already for server " + sid);
        }
    }
    public boolean initiateConnection(Socket sock, Long sid) {
        DataOutputStream dout = null;
        try {
            // Sending id and challenge
            // 获取socket输出流
            dout = new DataOutputStream(sock.getOutputStream());
            // 第一次建立连接后,会先向对方传输自己的sid
            dout.writeLong(self.getId());
            dout.flush();
        } catch (IOException e) {
            LOG.warn("Ignoring exception reading or writing challenge: ", e);
            closeSocket(sock);
            return false;
        }
        
        // If lost the challenge, then drop the new connection
       /*
        * 如果连接的sid大于自己的sid,那么需要关闭socket,等待大的sid
        * 主动建立连接,因为zk投票连接有一个规则,大的sid连接小的sid
        */
        if (sid > self.getId()) {
            LOG.info("Have smaller server identifier, so dropping the " +
                     "connection: (" + sid + ", " + self.getId() + ")");
            // 关闭连接         
            closeSocket(sock);
            // Otherwise proceed with the connection
        } else {
        	/*
        	 *  创建SendWorker传输线程
        	 *  创建RecvWorker接收线程
        	 */
            SendWorker sw = new SendWorker(sock, sid);
            RecvWorker rw = new RecvWorker(sock, sid, sw);
            sw.setRecv(rw);					
            SendWorker vsw = senderWorkerMap.get(sid);            
            if(vsw != null)
                vsw.finish();            
            // 将SendWorker 传输线程加入senderWorkerMap中。
            senderWorkerMap.put(sid, sw);     
            // 判断queueSendMap是否存在sid的发送消息队列
            if (!queueSendMap.containsKey(sid)) {
            	// 不存在则创建
                queueSendMap.put(sid, new ArrayBlockingQueue<ByteBuffer>(
                        SEND_CAPACITY));
            }            
            // 开启线程
            sw.start();
            rw.start();
            
            return true;    
            
        }
        return false;
    }

上面有说如果需要发送的投票sid大于自己sid,那么需要发送自身sid后,主动关闭连接,等待大的sid连接,我看一下这个逻辑。

类:org.apache.zookeeper.server.quorum.QuorumCnxManager.Listener

public void run() {
            int numRetries = 0;
            InetSocketAddress addr;
            while((!shutdown) && (numRetries < 3)){
                try {
                	// 监听投票端口
                    ss = new ServerSocket();
                    ss.setReuseAddress(true);
                    if (self.getQuorumListenOnAllIPs()) {
                        int port = self.quorumPeers.get(self.getId()).electionAddr.getPort();
                        addr = new InetSocketAddress(port);
                    } else {
                        addr = self.quorumPeers.get(self.getId()).electionAddr;
                    }
                    LOG.info("My election bind port: " + addr.toString());
                    setName(self.quorumPeers.get(self.getId()).electionAddr
                            .toString());
                    ss.bind(addr);
                    while (!shutdown) {
                    	// 接收socket连接
                        Socket client = ss.accept();
                        setSockOpts(client);
                        LOG.info("Received connection request "
                                + client.getRemoteSocketAddress());
                        // 接收连接后,进行一些socket处理                                
                        receiveConnection(client);
                        numRetries = 0;
                    }
                } catch (IOException e) {
                    LOG.error("Exception while listening", e);
                    numRetries++;
                    try {
                        ss.close();
                        Thread.sleep(1000);
                    } catch (IOException ie) {
                        LOG.error("Error closing server socket", ie);
                    } catch (InterruptedException ie) {
                        LOG.error("Interrupted while sleeping. " +
                                  "Ignoring exception", ie);
                    }
                }
            }
            LOG.info("Leaving listener");
            if (!shutdown) {
                LOG.error("As I'm leaving the listener thread, "
                        + "I won't be able to participate in leader "
                        + "election any longer: "
                        + self.quorumPeers.get(self.getId()).electionAddr);
            }
        }

    public boolean receiveConnection(Socket sock) {
        Long sid = null;
        
        try {
            // Read server id
            DataInputStream din = new DataInputStream(sock.getInputStream());
            // 接收传输的sid
            sid = din.readLong();
            if (sid < 0) { // this is not a server id but a protocol version (see ZOOKEEPER-1633)
                sid = din.readLong();
                // next comes the #bytes in the remainder of the message
                int num_remaining_bytes = din.readInt();
                byte[] b = new byte[num_remaining_bytes];
                // remove the remainder of the message from din
                int num_read = din.read(b);
                if (num_read != num_remaining_bytes) {
                    LOG.error("Read only " + num_read + " bytes out of " + num_remaining_bytes + " sent by server " + sid);
                }
            }
            if (sid == QuorumPeer.OBSERVER_ID) {
                /*
                 * Choose identifier at random. We need a value to identify
                 * the connection.
                 */
                
                sid = observerCounter--;
                LOG.info("Setting arbitrary identifier to observer: " + sid);
            }
        } catch (IOException e) {
            closeSocket(sock);
            LOG.warn("Exception reading or writing challenge: " + e.toString());
            return false;
        }
        
        //If wins the challenge, then close the new connection.
       	// 这里的逻辑和之前说的是一样的,如果接收的sid小于自身的id,那么关闭此次连接,主动连接该服务器
        if (sid < self.getId()) {
            /*
             * This replica might still believe that the connection to sid is
             * up, so we have to shut down the workers before trying to open a
             * new connection.
             */
            SendWorker sw = senderWorkerMap.get(sid);
            if (sw != null) {
                sw.finish();
            }

            /*
             * Now we start a new connection
             */
            LOG.debug("Create new connection to server: " + sid);
            closeSocket(sock);
            connectOne(sid);

            // Otherwise start worker threads to receive data.
        } else {
            SendWorker sw = new SendWorker(sock, sid);
            RecvWorker rw = new RecvWorker(sock, sid, sw);
            sw.setRecv(rw);

            SendWorker vsw = senderWorkerMap.get(sid);
            
            if(vsw != null)
                vsw.finish();
            
            senderWorkerMap.put(sid, sw);
            
            if (!queueSendMap.containsKey(sid)) {
                queueSendMap.put(sid, new ArrayBlockingQueue<ByteBuffer>(
                        SEND_CAPACITY));
            }
            
            sw.start();
            rw.start();
            
            return true;    
        }
        return false;
    }

类:org.apache.zookeeper.server.quorum.QuorumCnxManager.SendWorker

        @Override
        public void run() {
            threadCnt.incrementAndGet();
            try {
                ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
                if (bq == null || isSendQueueEmpty(bq)) {
                   ByteBuffer b = lastMessageSent.get(sid);
                   if (b != null) {
                       LOG.debug("Attempting to send lastMessage to sid=" + sid);
                       send(b);
                   }
                }
            } catch (IOException e) {
                LOG.error("Failed to send last message. Shutting down thread.", e);
                this.finish();
            }
            
            try {
            	//进行一个死循环
                while (running && !shutdown && sock != null) {

                    ByteBuffer b = null;
                    try {
                        // 取出sid对应的发送队列
                        ArrayBlockingQueue<ByteBuffer> bq = queueSendMap
                                .get(sid);
                        if (bq != null) {
                            b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
                        } else {
                            LOG.error("No queue of incoming messages for " +
                                      "server " + sid);
                            break;
                        }

                        if(b != null){
                            lastMessageSent.put(sid, b);
                            // 调用send方法进行发送
                            send(b);
                        }
                    } catch (InterruptedException e) {
                        LOG.warn("Interrupted while waiting for message on queue",
                                e);
                    }
                }
            } catch (Exception e) {
                LOG.warn("Exception when using channel: for id " + sid + " my id = " + 
                        self.getId() + " error = " + e);
            }
            this.finish();
            LOG.warn("Send worker leaving thread");
        }
        synchronized void send(ByteBuffer b) throws IOException {
            byte[] msgBytes = new byte[b.capacity()];
            try {
                b.position(0);
                b.get(msgBytes);
            } catch (BufferUnderflowException be) {
                LOG.error("BufferUnderflowException ", be);
                return;
            }
            // 利用socket发送数据
            dout.writeInt(b.capacity());
            dout.write(b.array());
            dout.flush();
        }

3、从选举通信层接受投票

类:org.apache.zookeeper.server.quorum.QuorumCnxManager.RecvWorker

        public void run() {
            threadCnt.incrementAndGet();
            try {
                while (running && !shutdown && sock != null) {
                    /**
                     * Reads the first int to determine the length of the
                     * message
                     */
                     // 首先读取传输的数据长度
                    int length = din.readInt();
                    if (length <= 0 || length > PACKETMAXSIZE) {
                        throw new IOException(
                                "Received packet with invalid packet: "
                                        + length);
                    }
                    /**
                     * Allocates a new ByteBuffer to receive the message
                     */
                    byte[] msgArray = new byte[length];
                    // 根据长度 在读取数据
                    din.readFully(msgArray, 0, length);
                    ByteBuffer message = ByteBuffer.wrap(msgArray);
                    // 将读取的数据 放入recvQueue,选举通信层接收队列
                    addToRecvQueue(new Message(message.duplicate(), sid));
                }
            } catch (Exception e) {
                LOG.warn("Connection broken for id " + sid + ", my id = " + 
                        self.getId() + ", error = " , e);
            } finally {
                LOG.warn("Interrupting SendWorker");
                sw.finish();
                if (sock != null) {
                    closeSocket(sock);
                }
            }
        }

类:org.apache.zookeeper.server.quorum.FastLeaderElection.Messenger.WorkerReceiver

public void run() {

                Message response;
                while (!stop) {
                    // Sleeps on receive
                    try{
                    	// 从选举通信层接收队列获取数据
                        response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
                        // 获取的数据为空,继续循环
                        if(response == null) continue;
							/*
							* 判断接收的数据,是否为Observer发送的
							*/                       							if(!self.getVotingView().containsKey(response.sid)){             
							/*
							 * 如果是Observer发送的投票数据     
							 * 那么什么事情不用干,只需要将自身投票
							 * 加入选举逻辑层发送队列
							 */                                                      				
                            Vote current = self.getCurrentVote();
                            ToSend notmsg = new ToSend(ToSend.mType.notification,
                                    current.getId(),
                                    current.getZxid(),
                                    logicalclock,
                                    self.getPeerState(),
                                    response.sid,
                                    current.getPeerEpoch());

                            sendqueue.offer(notmsg);
                        } else {
                        // 如果接收的不是Observer发送的
                            // Receive new message
                            if (LOG.isDebugEnabled()) {
                                LOG.debug("Receive new notification message. My id = "
                                        + self.getId());
                            }

                            /*
                             * We check for 28 bytes for backward compatibility
                             */
                             // 如果接收的数据长度小于28,说明是非法数据,什么都不做,继续循环
                            if (response.buffer.capacity() < 28) {
                                LOG.error("Got a short response: "
                                        + response.buffer.capacity());
                                continue;
                            }
                            boolean backCompatibility = (response.buffer.capacity() == 28);
                            response.buffer.clear();

                            // Instantiate Notification and set its attributes
                            Notification n = new Notification();
                            
                            // State of peer that sent this message
                            QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
                            
                            switch (response.buffer.getInt()) {
                            case 0:
                                ackstate = QuorumPeer.ServerState.LOOKING;
                                break;
                            case 1:
                                ackstate = QuorumPeer.ServerState.FOLLOWING;
                                break;
                            case 2:
                                ackstate = QuorumPeer.ServerState.LEADING;
                                break;
                            case 3:
                                ackstate = QuorumPeer.ServerState.OBSERVING;
                                break;
                            default:
                                continue;
                            }
                            
                            n.leader = response.buffer.getLong();
                            n.zxid = response.buffer.getLong();
                            n.electionEpoch = response.buffer.getLong();
                            n.state = ackstate;
                            n.sid = response.sid;
                            if(!backCompatibility){
                                n.peerEpoch = response.buffer.getLong();
                            } else {
                                if(LOG.isInfoEnabled()){
                                    LOG.info("Backward compatibility mode, server id=" + n.sid);
                                }
                                n.peerEpoch = ZxidUtils.getEpochFromZxid(n.zxid);
                            }

                            /*
                             * Version added in 3.4.6
                             */

                            n.version = (response.buffer.remaining() >= 4) ? 
                                         response.buffer.getInt() : 0x0;

                            /*
                             * Print notification info
                             */
                            if(LOG.isInfoEnabled()){
                                printNotification(n);
                            }

                            /*
                             * If this server is looking, then send proposed leader
                             */

                            if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
                                recvqueue.offer(n);

                                /*
                                 * Send a notification back if the peer that sent this
                                 * message is also looking and its logical clock is
                                 * lagging behind.
                                 */
                                if((ackstate == QuorumPeer.ServerState.LOOKING)
                                        && (n.electionEpoch < logicalclock)){
                                    Vote v = getVote();
                                    ToSend notmsg = new ToSend(ToSend.mType.notification,
                                            v.getId(),
                                            v.getZxid(),
                                            logicalclock,
                                            self.getPeerState(),
                                            response.sid,
                                            v.getPeerEpoch());
                                    sendqueue.offer(notmsg);
                                }
                            } else {
                                /*
                                 * If this server is not looking, but the one that sent the ack
                                 * is looking, then send back what it believes to be the leader.
                                 */
                                Vote current = self.getCurrentVote();
                                if(ackstate == QuorumPeer.ServerState.LOOKING){
                                    if(LOG.isDebugEnabled()){
                                        LOG.debug("Sending new notification. My id =  " +
                                                self.getId() + " recipient=" +
                                                response.sid + " zxid=0x" +
                                                Long.toHexString(current.getZxid()) +
                                                " leader=" + current.getId());
                                    }
                                    
                                    ToSend notmsg;
                                    if(n.version > 0x0) {
                                        notmsg = new ToSend(
                                                ToSend.mType.notification,
                                                current.getId(),
                                                current.getZxid(),
                                                current.getElectionEpoch(),
                                                self.getPeerState(),
                                                response.sid,
                                                current.getPeerEpoch());
                                        
                                    } else {
                                        Vote bcVote = self.getBCVote();
                                        notmsg = new ToSend(
                                                ToSend.mType.notification,
                                                bcVote.getId(),
                                                bcVote.getZxid(),
                                                bcVote.getElectionEpoch(),
                                                self.getPeerState(),
                                                response.sid,
                                                bcVote.getPeerEpoch());
                                    }
                                    sendqueue.offer(notmsg);
                                }
                            }
                        }
                    } catch (InterruptedException e) {
                        System.out.println("Interrupted Exception while waiting for new message" +
                                e.toString());
                    }
                }
                LOG.info("WorkerReceiver is down");
            }

zk选举使用了非常多的的线程和队列,比较麻烦的就是找到监听队列的线程,但是总体代码不难阅读,简化的流程图如下:
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值