先上图:
首先看到这个图肯定就会很懵,确实比较多,体系结构也比较复杂,目前但是如果是根据源码调试加上这篇博客应该问题就不是很大。
我们首先从判定为集群开始说起:即从org.apache.zookeeper.server.quorum.QuorumPeerMain#runFromConfig开始。这里重点说一下quorumPeer.start();
方法。
@Override
public synchronized void start() {
//加载事务和快照。其实就是恢复的意思。主要就是把数据加载到内存,拿到纪元号之类的
loadDataBase();
//就是启动thread线程,跟单机的那个线程做的一样的事
cnxnFactory.start();
//领导者选举。(重要)
//https://www.cnblogs.com/johnvwan/p/9546909.html 这篇博客讲得很好。
startLeaderElection();
//使用选举算法选出领导并开始同步。
super.start();
}
前面2个不说,就跟单机启动一模一样。重点是第三个和第四个。
跟踪到这个方法中:org.apache.zookeeper.server.quorum.QuorumPeer#startLeaderElection
synchronized public void startLeaderElection() {
try {
//参数:1、myid。2.最新的zxid。3、当前的纪元号
//首先投自己一票。在投票箱中写上自己的信息。
currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
} catch(IOException e) {
RuntimeException re = new RuntimeException(e.getMessage());
re.setStackTrace(e.getStackTrace());
throw re;
}
//getView().values()就是获取到配置文件中的所有以server.开头的信息。包括observer
for (QuorumServer p : getView().values()) {
if (p.id == myid) {//如果id号是自己
//为什么要有这个判断,因为你不但要告诉其它服务器投的谁,你还要告诉其它服务器是你投的
myQuorumAddr = p.addr;
break;
}
}
if (myQuorumAddr == null) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
if (electionType == 0) {
try {
udpSocket = new DatagramSocket(myQuorumAddr.getPort());
responder = new ResponderThread();
responder.start();
} catch (SocketException e) {
throw new RuntimeException(e);
}
}
//创建选举算法,默认是3.electionAlg可以更改。
//此外还有0,1,2.但是现在一般都不推荐了。所以一般不会动
//步骤如下:
//1、初始化QuorumCnxManager
//2、初始化QuorumCnxManager.Listener
//3、运行QuorumCnxManager.Listener
//4、运行QuorumCnxManager
//5、返回FastLeaderElection对象
//流程如下:
//1.把自己的投票放入queueSendMap中:用于发送自己的投票信息。首先肯定是投自己。
//2.queueSendMap中发送给其它服务器,如果是自己这一台就直接放到recvQueue中,如果是其它服务器就通过socket发送投票信息。
//3.不断的检测recvQueue中的投票信息,如果在某一时刻recvQueue中过半的服务器(通过sid标识),那个被投票的那个人就设置为leader,其它就设置为follower。
//服务器连接方式id大的去连接id小的,不允许小的连接大的。
this.electionAlg = createElectionAlgorithm(electionType);
}
这个方法的整体逻辑已经在方法注释中体现出来了。最复杂的是哪个方法呢?看注释就知道是createElectionAlgorithm这个方法了。
protected Election createElectionAlgorithm(int electionAlgorithm){
Election le=null;
//TODO: use a factory rather than a switch
switch (electionAlgorithm) {
case 0:
le = new LeaderElection(this);
break;
case 1:
le = new AuthFastLeaderElection(this);
break;
case 2:
le = new AuthFastLeaderElection(this, true);
break;
case 3:
//创建一个QuorumCnxManager,里面存在几个重要的属性。
//ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>> queueSendMap:
//ConcurrentHashMap<Long, SendWorker> senderWorkerMap:就是用来记录其他服务器id以及对应的SendWorker的
//ArrayBlockingQueue<Message> recvQueue:保存选票
//QuorumCnxManager.Listener
qcm = createCnxnManager();
//姑且认为就是一个监听器。因为是一个线程。
QuorumCnxManager.Listener listener = qcm.listener;
if(listener != null){
listener.start();//启动监听器
//这里面比较复杂
le = new FastLeaderElection(this, qcm);
} else {
LOG.error("Null listener when initializing cnx manager");
}
break;
default:
assert false;
}
return le;
}
由于0 1 2都是不建议使用了,所以咱们就直接看3的情况。当然这里面哪里很复杂呢?当然是le = new FastLeaderElection(this, qcm);这个了。这个也是今天的重点,快速领导者选举,会很复杂,我们一步一步来。
FastLeaderElection初始化
public FastLeaderElection(QuorumPeer self, QuorumCnxManager manager){
this.stop = false;
this.manager = manager;
starter(self, manager);
}
干了什么呢?就是初始化的赋值,然后进入starter方法。好像也没什么,我们看看这个方法。
starter(self, manager)
private void starter(QuorumPeer self, QuorumCnxManager manager) {
this.self = self;
proposedLeader = -1;
proposedZxid = -1;
sendqueue = new LinkedBlockingQueue<ToSend>();
recvqueue = new LinkedBlockingQueue<Notification>();
this.messenger = new Messenger(manager);
}
前面3行就是初始化的。重点是后面3行。看起来好像也没有什么,但是意义非常重大。你现在不需要记住什么,就只需要记住一点,初始化了两个queue。sendqueue 和recvqueue ,用处顾名思义。牢记,后面能用上。
this.messenger = new Messenger(manager)
Messenger(QuorumCnxManager manager) {
this.ws = new WorkerSender(manager);
Thread t = new Thread(this.ws,
"WorkerSender[myid=" + self.getId() + "]");
t.setDaemon(true);
t.start();
this.wr = new WorkerReceiver(manager);
t = new Thread(this.wr,
"WorkerReceiver[myid=" + self.getId() + "]");
t.setDaemon(true);
t.start();
}
看起也简单。new WorkerSender和new WorkerReceiver而已,然后就开子线程启动了。那么现在开始涉及到多线程了。我们先看WorkerSender,再看WorkerReceiver。
WorkerSender
public void run() {
while (!stop) {
try {
//这里取了是把所有的参与者都放进去了。什么时候放进去的呢?
// 通过sendNotifications函数放进去的。
ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
if(m == null) continue;
//开始处理
process(m);
} catch (InterruptedException e) {
break;
}
}
LOG.info("WorkerSender is down");
}
sendqueue有印象没有,前面初始化提到过哦。这个是取方法。那么肯定就有放的方法。从哪里放的呢?当然你看到了我的注释。但是啥时候调用的这个方法?神奇不?
当然调试就知道啦!
一直往前推就能追溯到这个方法:org.apache.zookeeper.server.quorum.QuorumPeer#run,这个方式是什么时候启动的呢?第一个代码段的super.start();方法启动的。我们看这段代码:
public void run() {
setName("QuorumPeer" + "[myid=" + getId() + "]" +
cnxnFactory.getLocalAddress());
System.out.println("当前线程:" + Thread.currentThread().getName());
LOG.debug("Starting quorum peer");
//这个try的内容看不懂要干啥?
try {
jmxQuorumBean = new QuorumBean(this);
MBeanRegistry.getInstance().register(jmxQuorumBean, null);
for(QuorumServer s: getView().values()){
ZKMBeanInfo p;
if (getId() == s.id) {
p = jmxLocalPeerBean = new LocalPeerBean(this);
try {
MBeanRegistry.getInstance().register(p, jmxQuorumBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
jmxLocalPeerBean = null;
}
} else {
p = new RemotePeerBean(s);
try {
MBeanRegistry.getInstance().register(p, jmxQuorumBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
}
}
}
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
jmxQuorumBean = null;
}
try {
/*
* Main loop
*/
//这里我有一个误区,我以为break是直接跳出了while循环,但实际是只是跳出了switch,还是基础不够扎实。break不但用于跳出循环,也用于跳出switch。
//测试文件:com.xq.test.TestWhile
while (running) {
switch (getPeerState()) {
case LOOKING:
LOG.info("LOOKING");
if (Boolean.getBoolean("readonlymode.enabled")) {//如果开启了只读
LOG.info("Attempting to start ReadOnlyZooKeeperServer");
// Create read-only server but don't start it immediately
final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
logFactory, this,
new ZooKeeperServer.BasicDataTreeBuilder(),
this.zkDb);
// Instead of starting roZk immediately, wait some grace
// period before we decide we're partitioned.
//
// Thread is used here because otherwise it would require
// changes in each of election strategy classes which is
// unnecessary code coupling.
Thread roZkMgr = new Thread() {
public void run() {
try {
// lower-bound grace period to 2 secs
sleep(Math.max(2000, tickTime));
if (ServerState.LOOKING.equals(getPeerState())) {
roZk.startup();
}
} catch (InterruptedException e) {
LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
} catch (Exception e) {
LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
}
}
};
try {
roZkMgr.start();
setBCVote(null);
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
setPeerState(ServerState.LOOKING);
} finally {
// If the thread is in the the grace period, interrupt
// to come out of waiting.
roZkMgr.interrupt();
roZk.shutdown();
}
} else {//没有开启只读
try {
setBCVote(null);
//不断更新投票,直到选出leader
setCurrentVote(makeLEStrategy().lookForLeader());
} catch (Exception e) {
LOG.warn("Unexpected exception", e);
setPeerState(ServerState.LOOKING);
}
}
break;
case OBSERVING:
try {
LOG.info("OBSERVING");
setObserver(makeObserver(logFactory));
observer.observeLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e );
} finally {
observer.shutdown();
setObserver(null);
setPeerState(ServerState.LOOKING);
}
break;
case FOLLOWING:
try {
LOG.info("FOLLOWING");
//产出一个适合follower的类
setFollower(makeFollower(logFactory));
follower.followLeader();
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
follower.shutdown();
setFollower(null);
setPeerState(ServerState.LOOKING);
}
break;
case LEADING:
LOG.info("LEADING");
try {
//产出一个适合leader的类
setLeader(makeLeader(logFactory));
//这个只是选择过半能接受的epoch,假设有更新的zxid没有启动
//在这个确定epoch的时候它启动了,那么这个leader会直接丢失多的数据吗?看代码好像是的
leader.lead();
setLeader(null);
} catch (Exception e) {
LOG.warn("Unexpected exception",e);
} finally {
if (leader != null) {
leader.shutdown("Forcing shutdown");
setLeader(null);
}
setPeerState(ServerState.LOOKING);
}
break;
}
}
} finally {
LOG.warn("QuorumPeer main thread exited");
try {
MBeanRegistry.getInstance().unregisterAll();
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
jmxQuorumBean = null;
jmxLocalPeerBean = null;
}
}
从长度和注释就知道这是一块非常核心的代码。因为它就是真正的选举核心以及选举完了之后的角色分配。这段代码实在是非常复杂,建议结合下面的图理解。
为了便于理解再用文字补充一下:
我们所了解到的WorkerSender线程是从sendqueue取的数据。而我们现在就是在看是那里放的数据?放的应该是什么数据?这个应该理解一下就能想到是你投票的数据了。关于多线程这块需要有一定的基础才方便理解。
现在我们回到WorkerSender线程里面,再来回顾一下这段代码:
public void run() {
while (!stop) {
try {
//这里取了是把所有的参与者都放进去了。什么时候放进去的呢?
// 通过sendNotifications函数放进去的。
ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
if(m == null) continue;
//开始处理
process(m);
} catch (InterruptedException e) {
break;
}
}
LOG.info("WorkerSender is down");
}
取出来之后重点就是process(m);处理了。
void process(ToSend m) {
ByteBuffer requestBuffer = buildMsg(m.state.ordinal(),
m.leader,
m.zxid,
m.electionEpoch,
m.peerEpoch);
//构造除byte的数组发送消息
manager.toSend(m.sid, requestBuffer);
}
主要是看怎么发送的,定位到 manager.toSend(m.sid, requestBuffer)。
public void toSend(Long sid, ByteBuffer b) {
/*
* If sending message to myself, then simply enqueue it (loopback).
*/
if (this.mySid == sid) {//如果是本机的话
b.position(0);
//直接添加到recvQueue中
addToRecvQueue(new Message(b.duplicate(), sid));
/*
* Otherwise send to the corresponding thread to send.
*/
} else {//不是本机的话
/*
* Start a new connection if doesn't have one already.
*/
ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
//putIfAbsent 如果传入key对应的value已经存在,就返回存在的value,不进行替换。如果不存在,就添加key和value,返回null
ArrayBlockingQueue<ByteBuffer> bqExisting = queueSendMap.putIfAbsent(sid, bq);
if (bqExisting != null) {//如果不是null,说明已经存在了。证明已经投过一票了。就把b加入到bqExisting中
addToSendQueue(bqExisting, b);
} else {//如果为null,说明目前为止还没有投过。就把b加入到bq中。为啥添加到不同的队列中?
addToSendQueue(bq, b);
}
connectOne(sid);
}
}
这段代码有细节没懂,不过不是很影响大局。
对照着图下一个应该是SendWorker线程了。这个是啥时候被造出来的?
这个时候分为两种情况。
1.当前就一台启动了,因为目前就一台,所以不需要像其它服务器发消息,所以这个时候并不会造出这个线程,而如果第二台起来了,它才会造这个线程。下面给个调用链。它是怎么实现的?就是不断的探测能否与其它服务器通信,只有能通信才造线程。
org.apache.zookeeper.server.quorum.QuorumPeer#run—>org.apache.zookeeper.server.quorum.QuorumPeer#setCurrentVote–>org.apache.zookeeper.server.quorum.QuorumCnxManager#connectAll–>org.apache.zookeeper.server.quorum.QuorumCnxManager#connectOne–>org.apache.zookeeper.server.quorum.QuorumCnxManager#initiateConnection–>org.apache.zookeeper.server.quorum.QuorumCnxManager#startConnection
2.当前不是第一个启动的,即直接就能与其它服务器通信。
这个又需要追溯到前面了。QuorumCnxManager.Listener listener = qcm.listener;这个在本页面搜一下。在启动listener之后。在这里面就已经造好了SendWorker和RecvWorker线程。这个又是怎么实现的?因为一启动就打开了端口,此时另外一台在向这边发数据,所以是被迫造的线程!
org.apache.zookeeper.server.quorum.QuorumCnxManager.Listener#run–>org.apache.zookeeper.server.quorum.QuorumCnxManager#receiveConnection–>org.apache.zookeeper.server.quorum.QuorumCnxManager#handleConnection
RecvWorker线程也是在这个时候造出来的。
WorkerReceiver也比较类似。虽然细节不一样,但是大体流程也应该清楚了。
主要是要对投票的逻辑图熟悉。在看这些稍微好点。
为了完整性再来看看接收流程。就从前面的WorkerReceiver开始。
Messenger(QuorumCnxManager manager) {
this.ws = new WorkerSender(manager);
Thread t = new Thread(this.ws,
"WorkerSender[myid=" + self.getId() + "]");
t.setDaemon(true);
t.start();
this.wr = new WorkerReceiver(manager);
t = new Thread(this.wr,
"WorkerReceiver[myid=" + self.getId() + "]");
t.setDaemon(true);
t.start();
}
这个线程对应的run方法也是非常复杂,因为按照常理,我们每接到一个更新的投票就要做统计。统计什么呢?统计是否过半了,以及对其它情况的处理。
因为代码太多了。我们先对照流程图来看。
从流程图上可以看到的是WorkerReceiver线程是从recvQueue中取数据,并放到recvqueue中。也就是说WorkerReceiver做了一个中间转接的过程。知道的大体的情况我们再来看详细的代码。
public void run() {
Message response;
while (!stop) {
// Sleeps on receive
try{
//从recvQueue中取出消息。
response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
if(response == null) continue;
/*
* If it is from an observer, respond right away.
* Note that the following predicate assumes that
* if a server is not a follower, then it must be
* an observer. If we ever have any other type of
* learner in the future, we'll have to change the
* way we check for observers.
*/
//判断能否投票,因为observer是不能投票的,所以先验证是不是参与者
if(!validVoter(response.sid)){//如果不是参与者(如观察者)
Vote current = self.getCurrentVote();
ToSend notmsg = new ToSend(ToSend.mType.notification,
current.getId(),
current.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
current.getPeerEpoch());
//直接把接收到的消息发出去。
sendqueue.offer(notmsg);
} else {//如果是参与者
// Receive new message
if (LOG.isDebugEnabled()) {
LOG.debug("Receive new notification message. My id = "
+ self.getId());
}
/*
* We check for 28 bytes for backward compatibility
*/
//兼容性检查
if (response.buffer.capacity() < 28) {
LOG.error("Got a short response: "
+ response.buffer.capacity());
continue;
}
boolean backCompatibility = (response.buffer.capacity() == 28);
//为啥要清空?
response.buffer.clear();
// Instantiate Notification and set its attributes
Notification n = new Notification();
// State of peer that sent this message
QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
switch (response.buffer.getInt()) {
case 0:
ackstate = QuorumPeer.ServerState.LOOKING;
break;
case 1:
ackstate = QuorumPeer.ServerState.FOLLOWING;
break;
case 2:
ackstate = QuorumPeer.ServerState.LEADING;
break;
case 3:
ackstate = QuorumPeer.ServerState.OBSERVING;
break;
default:
continue;
}
n.leader = response.buffer.getLong();
n.zxid = response.buffer.getLong();
n.electionEpoch = response.buffer.getLong();
n.state = ackstate;
n.sid = response.sid;
if(!backCompatibility){
n.peerEpoch = response.buffer.getLong();
} else {
if(LOG.isInfoEnabled()){
LOG.info("Backward compatibility mode, server id=" + n.sid);
}
n.peerEpoch = ZxidUtils.getEpochFromZxid(n.zxid);
}
/*
* Version added in 3.4.6
*/
n.version = (response.buffer.remaining() >= 4) ?
response.buffer.getInt() : 0x0;
/*
* Print notification info
*/
if(LOG.isInfoEnabled()){
printNotification(n);
}
/*
* If this server is looking, then send proposed leader
*/
if(self.getPeerState() == QuorumPeer.ServerState.LOOKING) {//如果自己也在寻找leader的话
recvqueue.offer(n);//就放到recvqueue中
/*
* Send a notification back if the peer that sent this
* message is also looking and its logical clock is
* lagging behind.
*/
//如果发送此消息的对方也正在LOOKING,并且其logical clock落后,则发送通知。
if((ackstate == QuorumPeer.ServerState.LOOKING)
&& (n.electionEpoch < logicalclock.get())){
Vote v = getVote();
ToSend notmsg = new ToSend(ToSend.mType.notification,
v.getId(),
v.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
v.getPeerEpoch());
sendqueue.offer(notmsg);
}
} else {//如果自己并不是寻找leader的话,则将自己认为的leader发送给对方。
/*
* If this server is not looking, but the one that sent the ack
* is looking, then send back what it believes to be the leader.
*/
Vote current = self.getCurrentVote();
System.out.println("当前认可的sid:"+current.getId());
if(ackstate == QuorumPeer.ServerState.LOOKING){
if(LOG.isDebugEnabled()){
LOG.debug("Sending new notification. My id = " +
self.getId() + " recipient=" +
response.sid + " zxid=0x" +
Long.toHexString(current.getZxid()) +
" leader=" + current.getId());
}
ToSend notmsg;
if(n.version > 0x0) {
notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
current.getElectionEpoch(),
self.getPeerState(),
response.sid,
current.getPeerEpoch());
} else {
Vote bcVote = self.getBCVote();
notmsg = new ToSend(
ToSend.mType.notification,
bcVote.getId(),
bcVote.getZxid(),
bcVote.getElectionEpoch(),
self.getPeerState(),
response.sid,
bcVote.getPeerEpoch());
}
sendqueue.offer(notmsg);
}
}
}
} catch (InterruptedException e) {
System.out.println("Interrupted Exception while waiting for new message" +
e.toString());
}
}
LOG.info("WorkerReceiver is down");
}
}
提一提上述代码的关键细节:
1.manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS)方法怎么取的
public Message pollRecvQueue(long timeout, TimeUnit unit)
throws InterruptedException {
return recvQueue.poll(timeout, unit);
}
2.放入recvqueue中
if(self.getPeerState() == QuorumPeer.ServerState.LOOKING) {//如果自己也在寻找leader的话
recvqueue.offer(n);//就放到recvqueue中
/*
* Send a notification back if the peer that sent this
* message is also looking and its logical clock is
* lagging behind.
*/
//如果发送此消息的对方也正在LOOKING,并且其logical clock落后,则发送通知。
if((ackstate == QuorumPeer.ServerState.LOOKING)
&& (n.electionEpoch < logicalclock.get())){
Vote v = getVote();
ToSend notmsg = new ToSend(ToSend.mType.notification,
v.getId(),
v.getZxid(),
logicalclock.get(),
self.getPeerState(),
response.sid,
v.getPeerEpoch());
sendqueue.offer(notmsg);
}
}
至于RecvWorker线程就自己取调试,比较简单。
最后我们来说说它的验证。这个也非常复杂了。看代码。
org.apache.zookeeper.server.quorum.FastLeaderElection#lookForLeader这个方法是在什么时候调用的呢?就是org.apache.zookeeper.server.quorum.QuorumPeer#run这个线程里面。这里面就描述了找leader的流程。
//领导者选举
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = Time.currentElapsedTime();
}
try {
//初始化一个投票的map
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = finalizeWait;
synchronized(this){
logicalclock.incrementAndGet();
//更新选票,这个时候一直投的自己
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
//向其它服务器发送选票
sendNotifications();
/*
* Loop in which we exchange notifications until we find a leader
*/
//不断获取其他服务器的投票信息,直到选出Leader
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)) {
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
/*
* Sends more notifications if haven't received enough.
* Otherwise processes new notification.
*/
if(n == null){//如果取到的为null,说明没有取到消息。
if(manager.haveDelivered()){
sendNotifications();
} else {
manager.connectAll();
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
}
else if(validVoter(n.sid) && validVoter(n.leader)) {
/*
* Only proceed if the vote comes from a replica in the
* voting view for a replica in the voting view.
*/
//处理投票
switch (n.state) {
case LOOKING:
// If notification > current, replace and send messages out
//如果参加选举的纪元(Epoch)>当前逻辑的纪元,说明参加选举的肯定比现在选举的更新,更有说服力
//所以当前就把纪元更新一下。清空自己得到的投票(比如唐代的尚方宝剑肯定不能拿到清朝使用,一个道理)。
//然后就投那个参加选举的纪元。
if (n.electionEpoch > logicalclock.get()) {
logicalclock.set(n.electionEpoch);
recvset.clear();
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
//通知其它服务器
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {//如果比当前小,证明参加选举的已经过时,就直接忽略这个投票信息。
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
}
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
//这个时候epoch肯定一致了。所以就来比较一下zxid,谁得zxid越新就投给谁。
//如果连zxid都一致,那就看sid(即serverid),这个肯定是能区分的,所以就确定了投票。
//但是如果自己跟自己比,就没有意义了,所以自己就不进。
updateProposal(n.leader, n.zxid, n.peerEpoch);
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
///判断是否投票结束,就是判断是否有过半的服务器投的同一个。
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock.get(), proposedEpoch))) {
// Verify if there is any change in the proposed leader
//验证投票中过半的投票是否有变动
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
//上面的while一直从recvqueue取,直到取为空,并且还要符合过半机制才结束,所以这里的n基本就是null
//然后就分配各自的角色。
if (n == null) {
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid,
logicalclock.get(),
proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
/*
* Consider all notifications from the same epoch
* together.
*/
if(n.electionEpoch == logicalclock.get()){
recvset.put(n.sid, new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch));
if(ooePredicate(recvset, outofelection, n)) {
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
}
/*
* Before joining an established ensemble, verify
* a majority is following the same leader.
*/
outofelection.put(n.sid, new Vote(n.version,
n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch,
n.state));
if(ooePredicate(outofelection, outofelection, n)) {
synchronized(this){
logicalclock.set(n.electionEpoch);
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
}
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
return endVote;
}
break;
default:
LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
n.state, n.sid);
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
LOG.debug("Number of connection processing threads: {}",
manager.getConnectionThreadCount());
}
}
看看上面代码中的totalOrderPredicate方法。为什么要说这个方法呢?因为这个方法多次出现,所以非常重要,这个方法就是拿来判断别的服务器是否比你更好。判断逻辑就是:Epoch,Zxid,myid依次比较,直到比较出一个大小出来。
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
if(self.getQuorumVerifier().getWeight(newId) == 0){
return false;
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
*/
return ((newEpoch > curEpoch) ||
((newEpoch == curEpoch) &&
((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
}
那么哪里判断是结束投票呢?org.apache.zookeeper.server.quorum.FastLeaderElection#termPredicate这个方法中。
protected boolean termPredicate(
HashMap<Long, Vote> votes,
Vote vote) {
HashSet<Long> set = new HashSet<Long>();
/*
* First make the views consistent. Sometimes peers will have
* different zxids for a server depending on timing.
*/
for (Map.Entry<Long,Vote> entry : votes.entrySet()) {
if (vote.equals(entry.getValue())){
set.add(entry.getKey());
}
}
//这里的containsQuorum具体在下面就是它的实现
return self.getQuorumVerifier().containsQuorum(set);
}
public boolean containsQuorum(Set<Long> set){
return (set.size() > half);
}
分配完后就要确定各自的角色了,值得一提的是它并不是过半马上就设置了,而是做了严谨的判断。
它确定好有leader产生后就会再次判断,是否还是这样的结果。而且会取完recvqueue的所有数据后确定是这个了才会分配角色。recvqueue取完就表示一件事情,其它服务器没有发送投票了,证明他们也选举了leader。所以在这个时候确定leader就能完全判断了。
有人可能会问万一服务器选出来的leader不同呢?
这个问题不存在的,首先过半验证就是说有一半以上的服务器选的同一个,那么想两个服务器选出不一样的leader,那么发送的数据永远不可能存在另一个过半的服务器(当然你非要说数据被篡改了之类的就不用说了,凡事无绝对),即不可能存在这种情况,而且必须是recvqueue空了后才设置角色的,说明基本要所有服务器确认了leader才能为空。所以选出来的一定是同一个leader!!
后面再说选举完了的流程,这篇先到这里,够长了。后面主要了解,设置角色之后各自需要做什么工作,发送一个请求的流程是怎样的?一致性是怎么保证的等等。