接着上篇启动流程分析,我们继续来看zk是如何进行选举的。
在上篇文章文章中,我们可以了解到启动流程中有几处地方涉及到选举
- org.apache.zookeeper.server.quorum.QuorumPeer中的start方法中调用了startLeaderElection()创建了一些选举需要用到的必要对象
- org.apache.zookeeper.server.quorum.QuorumPeer中的run方法死循环中,如果当前是LOOKING状态,会调用选举算法(默认:FastLeaderElection)的lookForLeader()方法触发选举
1、首先我们先来去看startLeaderElection()方法
类:org.apache.zookeeper.server.quorum.QuorumPeer
synchronized public void startLeaderElection() {
try {
currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
} catch(IOException e) {
RuntimeException re = new RuntimeException(e.getMessage());
re.setStackTrace(e.getStackTrace());
throw re;
}
for (QuorumServer p : getView().values()) {
if (p.id == myid) {
myQuorumAddr = p.addr;
break;
}
}
if (myQuorumAddr == null) {
throw new RuntimeException("My id " + myid + " not in the peer list");
}
// 如果electionType=0,那么创建UDP进行传输
if (electionType == 0) {
try {
udpSocket = new DatagramSocket(myQuorumAddr.getPort());
responder = new ResponderThread();
responder.start();
} catch (SocketException e) {
throw new RuntimeException(e);
}
}
// 根据electionType创建对应的选举算法
this.electionAlg = createElectionAlgorithm(electionType);
}
protected Election createElectionAlgorithm(int electionAlgorithm){
Election le=null;
//TODO: use a factory rather than a switch
// 根据electionAlgorithm创建对应的选举算法,如果没有设置默认为3,而且其他两种算法已经废弃,所以我们只说第三种情况
switch (electionAlgorithm) {
case 0:
le = new LeaderElection(this);
break;
case 1:
le = new AuthFastLeaderElection(this);
break;
case 2:
le = new AuthFastLeaderElection(this, true);
break;
case 3:
// 首先创建选举通信管理器
qcm = new QuorumCnxManager(this);
QuorumCnxManager.Listener listener = qcm.listener;
if(listener != null){
// 启动监听器,该监听器主要接受服务器之间的选举投票,并且将通信层获取的投票传递至选举逻辑层
listener.start();
// 创建选举算法
le = new FastLeaderElection(this, qcm);
} else {
LOG.error("Null listener when initializing cnx manager");
}
break;
default:
assert false;
}
return le;
}
选举通信管理器的创建
类:org.apache.zookeeper.server.quorum.QuorumCnxManager
public QuorumCnxManager(QuorumPeer self) {
// 存储通信层接受到其他服务器投票的队列
this.recvQueue = new ArrayBlockingQueue<Message>(RECV_CAPACITY);
// key:服务器id,value:需要发送的投票队列
this.queueSendMap = new ConcurrentHashMap<Long, ArrayBlockingQueue<ByteBuffer>>();
// key:服务器id,value:通信对象(Socket)
this.senderWorkerMap = new ConcurrentHashMap<Long, SendWorker>();
// key:服务器id,value:存储发送的最后一条消息
this.lastMessageSent = new ConcurrentHashMap<Long, ByteBuffer>();
String cnxToValue = System.getProperty("zookeeper.cnxTimeout");
if(cnxToValue != null){
this.cnxTO = new Integer(cnxToValue);
}
this.self = self;
// Starts listener thread that waits for connection requests
// 创建一个监听器,里面会创建一个ServerSocket监听选举端口,接受别的服务器的连接和投票
listener = new Listener();
}
FastLeaderElection的创建
类:org.apache.zookeeper.server.quorum.FastLeaderElection
public FastLeaderElection(QuorumPeer self, QuorumCnxManager manager){
this.stop = false;
this.manager = manager;
starter(self, manager);
}
private void starter(QuorumPeer self, QuorumCnxManager manager) {
this.self = self;
proposedLeader = -1;
proposedZxid = -1;
// 选举逻辑层发送投票队列
sendqueue = new LinkedBlockingQueue<ToSend>();
// 选举逻辑层接收投票队列
recvqueue = new LinkedBlockingQueue<Notification>();
// 创建Messenger对象,该对象主要创建了2个监听上述队列的线程
this.messenger = new Messenger(manager);
}
Messenger(QuorumCnxManager manager) {
/*
* 创建WorkerSender线程,监听sendqueue队列,主要接收逻辑层的
* 投票,向选举通信层传输
*/
this.ws = new WorkerSender(manager);
Thread t = new Thread(this.ws,
"WorkerSender[myid=" + self.getId() + "]");
t.setDaemon(true);
t.start();
/*
* 创建WorkerReceiver线程,主要接收选举通信层其他服务器
* 的投票,接收到的投票放入recvqueue队列中
*/
this.wr = new WorkerReceiver(manager);
t = new Thread(this.wr,
"WorkerReceiver[myid=" + self.getId() + "]");
t.setDaemon(true);
t.start();
}
从上面的创建对象流程我们可以大致了解到,选举可以分为2个阶段分别为逻辑层(将通信层接收到的投票进行一些处理),通信层(主要socket操作)。
2、接下来我们看一下FastLeaderElection.lookForLeader()选举逻辑
public Vote lookForLeader() throws InterruptedException {
try {
self.jmxLeaderElectionBean = new LeaderElectionBean();
MBeanRegistry.getInstance().register(
self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
} catch (Exception e) {
LOG.warn("Failed to register with JMX", e);
self.jmxLeaderElectionBean = null;
}
if (self.start_fle == 0) {
self.start_fle = System.currentTimeMillis();
}
try {
/*
* 创建一个当前选举周期的投票箱
*/
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
/*
* 创建一个投票箱。这个投票箱和recvset 不一样。
* 存储当前集群中如果已经存在Leader了的投票
*/
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
int notTimeout = finalizeWait;
synchronized(this){
// 递增本地选举周期
logicalclock++;
// 为自己投票
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
LOG.info("New election. My id = " + self.getId() +
", proposed zxid=0x" + Long.toHexString(proposedZxid));
// 广播投票
sendNotifications();
/*
* Loop in which we exchange notifications until we find a leader
*/
// 如果当前服务器的状态为Looking,和stop参数为false,那么进行选举
while ((self.getPeerState() == ServerState.LOOKING) &&
(!stop)){
/*
* Remove next notification from queue, times out after 2 times
* the termination time
*/
// 监听通信层接收的投票
Notification n = recvqueue.poll(notTimeout,
TimeUnit.MILLISECONDS);
// 如果一定时间内没有接收到投票
if(n == null){
// 判断通信层是否存在发送队列为空
if(manager.haveDelivered()){
/*
* 如果存在一个队列为空
* 继续广播投票
*/
sendNotifications();
} else {
/*
* 如果队列都不为空,那么说明和集群内所有的服
* 务器都连接不成功,或者集群内的服务器没有
* 启动,触发连接重试
*/
manager.connectAll();
}
/*
* Exponential backoff
*/
int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?
tmpTimeOut : maxNotificationInterval);
LOG.info("Notification time out: " + notTimeout);
}
// 判断投票的服务器是否有投票权,因为OBSERVER没有选举权
else if(self.getVotingView().containsKey(n.sid)) {
// 根据接收的投票进行对应的逻辑操作
switch (n.state) {
case LOOKING:
// If notification > current, replace and send messages out
// 如果接收的投票的选举周期比当前服务器周期大
if (n.electionEpoch > logicalclock) {
// 将本地选举周期更新为接收的投票的选举周期
logicalclock = n.electionEpoch;
// 清空投票箱
recvset.clear();
// 进行投票PK
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
// 投票PK成功,更新当前的投票为接收到投票
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
// 投票PK失败,更新当前的选票为自己的
updateProposal(getInitId(),
getInitLastLoggedZxid(),
getPeerEpoch());
}
// 广播投票
sendNotifications();
// 如果接收的投票的选举周期比当前服务器小
} else if (n.electionEpoch < logicalclock) {
// 那么直接丢弃,什么都不干
if(LOG.isDebugEnabled()){
LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
+ Long.toHexString(n.electionEpoch)
+ ", logicalclock=0x" + Long.toHexString(logicalclock));
}
break;
// 如果接收的投票的选举周期和当前服务器一样
// 那么进行投票PK
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)) {
// PK成功,更新当前的投票
updateProposal(n.leader, n.zxid, n.peerEpoch);
// 广播投票
sendNotifications();
}
if(LOG.isDebugEnabled()){
LOG.debug("Adding vote: from=" + n.sid +
", proposed leader=" + n.leader +
", proposed zxid=0x" + Long.toHexString(n.zxid) +
", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
}
// 将接收的投票加入投票箱
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
// 判断当前的投票是否超过一半
if (termPredicate(recvset,
new Vote(proposedLeader, proposedZxid,
logicalclock, proposedEpoch))) {
// Verify if there is any change in the proposed leader
/*
* 继续监听通信层接收的投票,可能存在投票
* 结果变化,所以这里需要继续监听,监听直到超时
*/
while((n = recvqueue.poll(finalizeWait,
TimeUnit.MILLISECONDS)) != null){
// 接收新的投票,进行投票PK
if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
proposedLeader, proposedZxid, proposedEpoch)){
// 如果投票PK成功,加入队列,说明结果可能会有变化,跳出循环
recvqueue.put(n);
break;
}
}
/*
* This predicate is true once we don't read any new
* relevant message from the reception queue
*/
/*
* 判断之前接收的投票是否为空,
*/
if (n == null) {
/*
* 如果为空,说明投票结果不会变化了。
* 根据最终胜出的投票来确定自身状态
* LEADING、FOLLOWING、OBSERVING
*/
self.setPeerState((proposedLeader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(proposedLeader,
proposedZxid,
logicalclock,
proposedEpoch);
leaveInstance(endVote);
// 返回最终胜出的投票
return endVote;
}
}
/*
* 如果不为空,那么说明投票结果可能变化了
* 需要跳出循环,继续新的一轮判断
*/
break;
case OBSERVING:
// 如果接收的投票是OBSERVING,直接丢弃
LOG.debug("Notification from observer: " + n.sid);
break;
case FOLLOWING:
case LEADING:
// 如果接收的投票是FOLLOWING或者LEADING
/*
* Consider all notifications from the same epoch
* together.
*/
// 判断选举周期和当前选举周期是否相等
if(n.electionEpoch == logicalclock){
// 加入投票箱
recvset.put(n.sid, new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch));
// 验证是否超过半数Follower都追随同一个Leader
if(ooePredicate(recvset, outofelection, n)) {
// 如果是,那么根据胜出的投票,更新自身状态
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
// 返回最终的投票
return endVote;
}
}
/*
* Before joining an established ensemble, verify
* a majority is following the same leader.
*/
/*
* 将选票加入outofelection,需要后续需要验证
* 是否多个Follower都追随同一个Leader
*/
outofelection.put(n.sid, new Vote(n.version,
n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch,
n.state));
/*
* 验证是否超过半数Follower都追随同一个Leader
*/
if(ooePredicate(outofelection, outofelection, n)) {
// 如果是,那么集群中已经有Leader
synchronized(this){
// 根据最终投票,来确定自身状态
logicalclock = n.electionEpoch;
self.setPeerState((n.leader == self.getId()) ?
ServerState.LEADING: learningState());
}
Vote endVote = new Vote(n.leader,
n.zxid,
n.electionEpoch,
n.peerEpoch);
leaveInstance(endVote);
// 返回最终投票
return endVote;
}
break;
default:
LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
n.state, n.sid);
break;
}
} else {
LOG.warn("Ignoring notification from non-cluster member " + n.sid);
}
}
return null;
} finally {
try {
if(self.jmxLeaderElectionBean != null){
MBeanRegistry.getInstance().unregister(
self.jmxLeaderElectionBean);
}
} catch (Exception e) {
LOG.warn("Failed to unregister with JMX", e);
}
self.jmxLeaderElectionBean = null;
}
}
通过上面分析,有几个比较重要的方法
- 投票PK
- 广播投票
- 从选举通信层接受投票
下述说的sid实际上就是每台服务器配置的myid
1、投票PK
类:org.apache.zookeeper.server.quorum.FastLeaderElection
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
if(self.getQuorumVerifier().getWeight(newId) == 0){
return false;
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
*/
/*
* 1、若外部的选举轮次大于内部选举轮次,那么外部选票PK成功
* 2、若选举轮次相等,若外部的zxid大于内部zxid,那么外部选票PK成功
* 3、若选举轮次相等,zxid也相等,那么比较两者的sid(服务器id),若外部比较大,那么外部选票PK成功
* 如果外部选票PK成功,都需要变更自身的选票,然后广播出去
*/
return ((newEpoch > curEpoch) ||
((newEpoch == curEpoch) &&
((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
}
2、广播投票
类:org.apache.zookeeper.server.quorum.FastLeaderElection
private void sendNotifications() {
// 获取所有具有投票权的服务器列表,即不是Observer类型的
for (QuorumServer server : self.getVotingView().values()) {
long sid = server.id;
// 构建发送的投票消息
ToSend notmsg = new ToSend(ToSend.mType.notification,
proposedLeader,
proposedZxid,
logicalclock,
QuorumPeer.ServerState.LOOKING,
sid,
proposedEpoch);
if(LOG.isDebugEnabled()){
LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x" +
Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock) +
" (n.round), " + sid + " (recipient), " + self.getId() +
" (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
}
// 将发送的投票放入选举逻辑层发送队列
sendqueue.offer(notmsg);
}
}
类:org.apache.zookeeper.server.quorum.FastLeaderElection.Messenger.WorkerSender
class WorkerSender implements Runnable {
volatile boolean stop;
QuorumCnxManager manager;
WorkerSender(QuorumCnxManager manager){
this.stop = false;
this.manager = manager;
}
public void run() {
while (!stop) {
try {
// 从sendqueue队列取出数据
ToSend m = sendqueue.poll(3000, TimeUnit.MILLISECONDS);
// 如果取出的数据为空,那么继续轮训
if(m == null) continue;
// 如果取出的数据不为空,那么调用process方法执行
process(m);
} catch (InterruptedException e) {
break;
}
}
LOG.info("WorkerSender is down");
}
void process(ToSend m) {
// 将实体转换成ByteBuffer进行发送
ByteBuffer requestBuffer = buildMsg(m.state.ordinal(),
m.leader,
m.zxid,
m.electionEpoch,
m.peerEpoch);
// 通过通信管理器进行发送投票
manager.toSend(m.sid, requestBuffer);
}
}
类:org.apache.zookeeper.server.quorum.QuorumCnxManager
public void toSend(Long sid, ByteBuffer b) {
// 判断发送的投票是否是对自己的投票
if (self.getId() == sid) {
// 如果是,直接将发送数据放入选举逻辑层接受队列(recvQueue)
b.position(0);
addToRecvQueue(new Message(b.duplicate(), sid));
/*
* Otherwise send to the corresponding thread to send.
*/
} else {
/*
* 判断是否存在该sid的发送消息队列
* 一个服务器对应一个发送消息队列
*/
if (!queueSendMap.containsKey(sid)) {
// 如果没有,创建一个定长的发送消息队列
ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(
SEND_CAPACITY);
// 将该队列加入queueSendMap中。key:sid。value:发送消息队列
queueSendMap.put(sid, bq);
// 将需要发送的ByteBuffer放入发送消息队列
addToSendQueue(bq, b);
} else {
// 说明已经存在发送小心队列,那么直接从map中取出即可
ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
if(bq != null){
// 将需要发送的ByteBuffer放入发送消息队列
addToSendQueue(bq, b);
} else {
LOG.error("No queue for server " + sid);
}
}
// 连接sid对应服务器
connectOne(sid);
}
}
synchronized void connectOne(long sid){
// 从map中取出SendWorker(投票通信对象,封装了Socket)
if (senderWorkerMap.get(sid) == null){
// 如果不存在,那么需要创建对应的对象
// 首先取出sid服务器对应的选举端口
InetSocketAddress electionAddr;
// 判断配置文件中是否配置了该sid
if (self.quorumPeers.containsKey(sid)) {
// 如果有配置,那么直接从容器中根据sid取出地址和端口
electionAddr = self.quorumPeers.get(sid).electionAddr;
} else {
// 没有配置 直接返回
LOG.warn("Invalid server id: " + sid);
return;
}
try {
if (LOG.isDebugEnabled()) {
LOG.debug("Opening channel to server " + sid);
}
// 创建socket,连接对应的服务器
Socket sock = new Socket();
setSockOpts(sock);
sock.connect(self.getView().get(sid).electionAddr, cnxTO);
if (LOG.isDebugEnabled()) {
LOG.debug("Connected to server " + sid);
}
// 初始化连接配置
initiateConnection(sock, sid);
} catch (UnresolvedAddressException e) {
// Sun doesn't include the address that causes this
// exception to be thrown, also UAE cannot be wrapped cleanly
// so we log the exception in order to capture this critical
// detail.
LOG.warn("Cannot open channel to " + sid
+ " at election address " + electionAddr, e);
throw e;
} catch (IOException e) {
LOG.warn("Cannot open channel to " + sid
+ " at election address " + electionAddr,
e);
}
} else {
LOG.debug("There is a connection already for server " + sid);
}
}
public boolean initiateConnection(Socket sock, Long sid) {
DataOutputStream dout = null;
try {
// Sending id and challenge
// 获取socket输出流
dout = new DataOutputStream(sock.getOutputStream());
// 第一次建立连接后,会先向对方传输自己的sid
dout.writeLong(self.getId());
dout.flush();
} catch (IOException e) {
LOG.warn("Ignoring exception reading or writing challenge: ", e);
closeSocket(sock);
return false;
}
// If lost the challenge, then drop the new connection
/*
* 如果连接的sid大于自己的sid,那么需要关闭socket,等待大的sid
* 主动建立连接,因为zk投票连接有一个规则,大的sid连接小的sid
*/
if (sid > self.getId()) {
LOG.info("Have smaller server identifier, so dropping the " +
"connection: (" + sid + ", " + self.getId() + ")");
// 关闭连接
closeSocket(sock);
// Otherwise proceed with the connection
} else {
/*
* 创建SendWorker传输线程
* 创建RecvWorker接收线程
*/
SendWorker sw = new SendWorker(sock, sid);
RecvWorker rw = new RecvWorker(sock, sid, sw);
sw.setRecv(rw);
SendWorker vsw = senderWorkerMap.get(sid);
if(vsw != null)
vsw.finish();
// 将SendWorker 传输线程加入senderWorkerMap中。
senderWorkerMap.put(sid, sw);
// 判断queueSendMap是否存在sid的发送消息队列
if (!queueSendMap.containsKey(sid)) {
// 不存在则创建
queueSendMap.put(sid, new ArrayBlockingQueue<ByteBuffer>(
SEND_CAPACITY));
}
// 开启线程
sw.start();
rw.start();
return true;
}
return false;
}
上面有说如果需要发送的投票sid大于自己sid,那么需要发送自身sid后,主动关闭连接,等待大的sid连接,我看一下这个逻辑。
类:org.apache.zookeeper.server.quorum.QuorumCnxManager.Listener
public void run() {
int numRetries = 0;
InetSocketAddress addr;
while((!shutdown) && (numRetries < 3)){
try {
// 监听投票端口
ss = new ServerSocket();
ss.setReuseAddress(true);
if (self.getQuorumListenOnAllIPs()) {
int port = self.quorumPeers.get(self.getId()).electionAddr.getPort();
addr = new InetSocketAddress(port);
} else {
addr = self.quorumPeers.get(self.getId()).electionAddr;
}
LOG.info("My election bind port: " + addr.toString());
setName(self.quorumPeers.get(self.getId()).electionAddr
.toString());
ss.bind(addr);
while (!shutdown) {
// 接收socket连接
Socket client = ss.accept();
setSockOpts(client);
LOG.info("Received connection request "
+ client.getRemoteSocketAddress());
// 接收连接后,进行一些socket处理
receiveConnection(client);
numRetries = 0;
}
} catch (IOException e) {
LOG.error("Exception while listening", e);
numRetries++;
try {
ss.close();
Thread.sleep(1000);
} catch (IOException ie) {
LOG.error("Error closing server socket", ie);
} catch (InterruptedException ie) {
LOG.error("Interrupted while sleeping. " +
"Ignoring exception", ie);
}
}
}
LOG.info("Leaving listener");
if (!shutdown) {
LOG.error("As I'm leaving the listener thread, "
+ "I won't be able to participate in leader "
+ "election any longer: "
+ self.quorumPeers.get(self.getId()).electionAddr);
}
}
public boolean receiveConnection(Socket sock) {
Long sid = null;
try {
// Read server id
DataInputStream din = new DataInputStream(sock.getInputStream());
// 接收传输的sid
sid = din.readLong();
if (sid < 0) { // this is not a server id but a protocol version (see ZOOKEEPER-1633)
sid = din.readLong();
// next comes the #bytes in the remainder of the message
int num_remaining_bytes = din.readInt();
byte[] b = new byte[num_remaining_bytes];
// remove the remainder of the message from din
int num_read = din.read(b);
if (num_read != num_remaining_bytes) {
LOG.error("Read only " + num_read + " bytes out of " + num_remaining_bytes + " sent by server " + sid);
}
}
if (sid == QuorumPeer.OBSERVER_ID) {
/*
* Choose identifier at random. We need a value to identify
* the connection.
*/
sid = observerCounter--;
LOG.info("Setting arbitrary identifier to observer: " + sid);
}
} catch (IOException e) {
closeSocket(sock);
LOG.warn("Exception reading or writing challenge: " + e.toString());
return false;
}
//If wins the challenge, then close the new connection.
// 这里的逻辑和之前说的是一样的,如果接收的sid小于自身的id,那么关闭此次连接,主动连接该服务器
if (sid < self.getId()) {
/*
* This replica might still believe that the connection to sid is
* up, so we have to shut down the workers before trying to open a
* new connection.
*/
SendWorker sw = senderWorkerMap.get(sid);
if (sw != null) {
sw.finish();
}
/*
* Now we start a new connection
*/
LOG.debug("Create new connection to server: " + sid);
closeSocket(sock);
connectOne(sid);
// Otherwise start worker threads to receive data.
} else {
SendWorker sw = new SendWorker(sock, sid);
RecvWorker rw = new RecvWorker(sock, sid, sw);
sw.setRecv(rw);
SendWorker vsw = senderWorkerMap.get(sid);
if(vsw != null)
vsw.finish();
senderWorkerMap.put(sid, sw);
if (!queueSendMap.containsKey(sid)) {
queueSendMap.put(sid, new ArrayBlockingQueue<ByteBuffer>(
SEND_CAPACITY));
}
sw.start();
rw.start();
return true;
}
return false;
}
类:org.apache.zookeeper.server.quorum.QuorumCnxManager.SendWorker
@Override
public void run() {
threadCnt.incrementAndGet();
try {
ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
if (bq == null || isSendQueueEmpty(bq)) {
ByteBuffer b = lastMessageSent.get(sid);
if (b != null) {
LOG.debug("Attempting to send lastMessage to sid=" + sid);
send(b);
}
}
} catch (IOException e) {
LOG.error("Failed to send last message. Shutting down thread.", e);
this.finish();
}
try {
//进行一个死循环
while (running && !shutdown && sock != null) {
ByteBuffer b = null;
try {
// 取出sid对应的发送队列
ArrayBlockingQueue<ByteBuffer> bq = queueSendMap
.get(sid);
if (bq != null) {
b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
} else {
LOG.error("No queue of incoming messages for " +
"server " + sid);
break;
}
if(b != null){
lastMessageSent.put(sid, b);
// 调用send方法进行发送
send(b);
}
} catch (InterruptedException e) {
LOG.warn("Interrupted while waiting for message on queue",
e);
}
}
} catch (Exception e) {
LOG.warn("Exception when using channel: for id " + sid + " my id = " +
self.getId() + " error = " + e);
}
this.finish();
LOG.warn("Send worker leaving thread");
}
synchronized void send(ByteBuffer b) throws IOException {
byte[] msgBytes = new byte[b.capacity()];
try {
b.position(0);
b.get(msgBytes);
} catch (BufferUnderflowException be) {
LOG.error("BufferUnderflowException ", be);
return;
}
// 利用socket发送数据
dout.writeInt(b.capacity());
dout.write(b.array());
dout.flush();
}
3、从选举通信层接受投票
类:org.apache.zookeeper.server.quorum.QuorumCnxManager.RecvWorker
public void run() {
threadCnt.incrementAndGet();
try {
while (running && !shutdown && sock != null) {
/**
* Reads the first int to determine the length of the
* message
*/
// 首先读取传输的数据长度
int length = din.readInt();
if (length <= 0 || length > PACKETMAXSIZE) {
throw new IOException(
"Received packet with invalid packet: "
+ length);
}
/**
* Allocates a new ByteBuffer to receive the message
*/
byte[] msgArray = new byte[length];
// 根据长度 在读取数据
din.readFully(msgArray, 0, length);
ByteBuffer message = ByteBuffer.wrap(msgArray);
// 将读取的数据 放入recvQueue,选举通信层接收队列
addToRecvQueue(new Message(message.duplicate(), sid));
}
} catch (Exception e) {
LOG.warn("Connection broken for id " + sid + ", my id = " +
self.getId() + ", error = " , e);
} finally {
LOG.warn("Interrupting SendWorker");
sw.finish();
if (sock != null) {
closeSocket(sock);
}
}
}
类:org.apache.zookeeper.server.quorum.FastLeaderElection.Messenger.WorkerReceiver
public void run() {
Message response;
while (!stop) {
// Sleeps on receive
try{
// 从选举通信层接收队列获取数据
response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
// 获取的数据为空,继续循环
if(response == null) continue;
/*
* 判断接收的数据,是否为Observer发送的
*/ if(!self.getVotingView().containsKey(response.sid)){
/*
* 如果是Observer发送的投票数据
* 那么什么事情不用干,只需要将自身投票
* 加入选举逻辑层发送队列
*/
Vote current = self.getCurrentVote();
ToSend notmsg = new ToSend(ToSend.mType.notification,
current.getId(),
current.getZxid(),
logicalclock,
self.getPeerState(),
response.sid,
current.getPeerEpoch());
sendqueue.offer(notmsg);
} else {
// 如果接收的不是Observer发送的
// Receive new message
if (LOG.isDebugEnabled()) {
LOG.debug("Receive new notification message. My id = "
+ self.getId());
}
/*
* We check for 28 bytes for backward compatibility
*/
// 如果接收的数据长度小于28,说明是非法数据,什么都不做,继续循环
if (response.buffer.capacity() < 28) {
LOG.error("Got a short response: "
+ response.buffer.capacity());
continue;
}
boolean backCompatibility = (response.buffer.capacity() == 28);
response.buffer.clear();
// Instantiate Notification and set its attributes
Notification n = new Notification();
// State of peer that sent this message
QuorumPeer.ServerState ackstate = QuorumPeer.ServerState.LOOKING;
switch (response.buffer.getInt()) {
case 0:
ackstate = QuorumPeer.ServerState.LOOKING;
break;
case 1:
ackstate = QuorumPeer.ServerState.FOLLOWING;
break;
case 2:
ackstate = QuorumPeer.ServerState.LEADING;
break;
case 3:
ackstate = QuorumPeer.ServerState.OBSERVING;
break;
default:
continue;
}
n.leader = response.buffer.getLong();
n.zxid = response.buffer.getLong();
n.electionEpoch = response.buffer.getLong();
n.state = ackstate;
n.sid = response.sid;
if(!backCompatibility){
n.peerEpoch = response.buffer.getLong();
} else {
if(LOG.isInfoEnabled()){
LOG.info("Backward compatibility mode, server id=" + n.sid);
}
n.peerEpoch = ZxidUtils.getEpochFromZxid(n.zxid);
}
/*
* Version added in 3.4.6
*/
n.version = (response.buffer.remaining() >= 4) ?
response.buffer.getInt() : 0x0;
/*
* Print notification info
*/
if(LOG.isInfoEnabled()){
printNotification(n);
}
/*
* If this server is looking, then send proposed leader
*/
if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
recvqueue.offer(n);
/*
* Send a notification back if the peer that sent this
* message is also looking and its logical clock is
* lagging behind.
*/
if((ackstate == QuorumPeer.ServerState.LOOKING)
&& (n.electionEpoch < logicalclock)){
Vote v = getVote();
ToSend notmsg = new ToSend(ToSend.mType.notification,
v.getId(),
v.getZxid(),
logicalclock,
self.getPeerState(),
response.sid,
v.getPeerEpoch());
sendqueue.offer(notmsg);
}
} else {
/*
* If this server is not looking, but the one that sent the ack
* is looking, then send back what it believes to be the leader.
*/
Vote current = self.getCurrentVote();
if(ackstate == QuorumPeer.ServerState.LOOKING){
if(LOG.isDebugEnabled()){
LOG.debug("Sending new notification. My id = " +
self.getId() + " recipient=" +
response.sid + " zxid=0x" +
Long.toHexString(current.getZxid()) +
" leader=" + current.getId());
}
ToSend notmsg;
if(n.version > 0x0) {
notmsg = new ToSend(
ToSend.mType.notification,
current.getId(),
current.getZxid(),
current.getElectionEpoch(),
self.getPeerState(),
response.sid,
current.getPeerEpoch());
} else {
Vote bcVote = self.getBCVote();
notmsg = new ToSend(
ToSend.mType.notification,
bcVote.getId(),
bcVote.getZxid(),
bcVote.getElectionEpoch(),
self.getPeerState(),
response.sid,
bcVote.getPeerEpoch());
}
sendqueue.offer(notmsg);
}
}
}
} catch (InterruptedException e) {
System.out.println("Interrupted Exception while waiting for new message" +
e.toString());
}
}
LOG.info("WorkerReceiver is down");
}
zk选举使用了非常多的的线程和队列,比较麻烦的就是找到监听队列的线程,但是总体代码不难阅读,简化的流程图如下: