zookeeper启动流程简单梳理

本文详细解析了ZooKeeper 3.4.6版本的启动流程,包括启动脚本、QuorumPeerMain的主要功能、ServerCnxnFactory的工作原理及数据恢复过程等关键技术点。

等着测试童鞋完工,顺便里了下zookeeper的启动流程

zk3.4.6

启动脚本里面

 nohup "$JAVA" "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \    -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null &

翻译过来之后太烦了,shit

java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/local/luping/zookeeper-3.4.6/bin/../build/classes:/usr/local/luping/zookeeper-3.4.6/bin/../build/lib/*.jar:/usr/local/luping/zookeeper-3.4.6/bin/../lib/slf4j-log4j12-1.6.1.jar:/usr/local/luping/zookeeper-3.4.6/bin/../lib/slf4j-api-1.6.1.jar:/usr/local/luping/zookeeper-3.4.6/bin/../lib/netty-3.7.0.Final.jar:/usr/local/luping/zookeeper-3.4.6/bin/../lib/log4j-1.2.16.jar:/usr/local/luping/zookeeper-3.4.6/bin/../lib/jline-0.9.94.jar:/usr/local/luping/zookeeper-3.4.6/bin/../zookeeper-3.4.6.jar:/usr/local/luping/zookeeper-3.4.6/bin/../src/java/lib/*.jar:/usr/local/luping/zookeeper-3.4.6/bin/../conf:.:/data/java/jdk1.7.0_15/jre/lib/rt.jar:/data/java/jdk1.7.0_15/lib/dt.jar:/data/java/jdk1.7.0_15/lib/tools.jar -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain
/usr/local/luping/zookeeper-3.4.6/bin/../conf/zoo.cfg

这里看到启动入口是QuorumPeerMain,先看其main()函数关键代码

QuorumPeerMain main = new QuorumPeerMain();
 main.initializeAndRun(args);

initializeAndRun()这种会启动单例还是集群模式的zookeeper

    protected void initializeAndRun(String[] args)
        throws ConfigException, IOException
    {
        QuorumPeerConfig config = new QuorumPeerConfig();
        if (args.length == 1) {
            config.parse(args[0]);
        }

        // Start and schedule the the purge task
        DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
                .getDataDir(), config.getDataLogDir(), config
                .getSnapRetainCount(), config.getPurgeInterval());
        purgeMgr.start();

        if (args.length == 1 && config.servers.size() > 0) {
            runFromConfig(config);
        } else {
            LOG.warn("Either no config or no quorum defined in config, running "
                    + " in standalone mode");
            // there is only server in the quorum -- run as standalone
            ZooKeeperServerMain.main(args);
        }
    }

这里如果是集群模式走runFromConfig,首先注册Mxbean

ManagedUtil.registerLog4jMBeans();

然后是初始化连接工厂,默认使用NIO,这里是NIOServerCnxnFactory

ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory();
          cnxnFactory.configure(config.getClientPortAddress(),
                                config.getMaxClientCnxns());
   static public ServerCnxnFactory createFactory() throws IOException {
        String serverCnxnFactoryName =
            System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY);
        if (serverCnxnFactoryName == null) {
            serverCnxnFactoryName = NIOServerCnxnFactory.class.getName();
        }
        try {
            return (ServerCnxnFactory) Class.forName(serverCnxnFactoryName)
                                                .newInstance();
        } catch (Exception e) {
            IOException ioe = new IOException("Couldn't instantiate "
                    + serverCnxnFactoryName);
            ioe.initCause(e);
            throw ioe;
        }
    }

z之后初始化QuorumPeer,进行相应的配置,最后启动QuorumPeer

    ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory();
          cnxnFactory.configure(config.getClientPortAddress(),
                                config.getMaxClientCnxns());

          quorumPeer = new QuorumPeer();
          quorumPeer.setClientPortAddress(config.getClientPortAddress());
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                      new File(config.getDataLogDir()),
                      new File(config.getDataDir())));
          quorumPeer.setQuorumPeers(config.getServers());
          quorumPeer.setElectionType(config.getElectionAlg());
          quorumPeer.setMyid(config.getServerId());
          quorumPeer.setTickTime(config.getTickTime());
          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
          quorumPeer.setInitLimit(config.getInitLimit());
          quorumPeer.setSyncLimit(config.getSyncLimit());
          quorumPeer.setQuorumVerifier(config.getQuorumVerifier());
          quorumPeer.setCnxnFactory(cnxnFactory);
          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
          quorumPeer.setLearnerType(config.getPeerType());
          quorumPeer.setSyncEnabled(config.getSyncEnabled());
          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());

          quorumPeer.start();
          quorumPeer.join();

ServerCnxnFactory.configure()这里配置ServerSocketChannel

  public void configure(InetSocketAddress addr, int maxcc) throws IOException {
        configureSaslLogin();

        thread = new Thread(this, "NIOServerCxn.Factory:" + addr);
        thread.setDaemon(true);
        maxClientCnxns = maxcc;
        this.ss = ServerSocketChannel.open();
        ss.socket().setReuseAddress(true);
        LOG.info("binding to port " + addr);
        ss.socket().bind(addr);
        ss.configureBlocking(false);
        ss.register(selector, SelectionKey.OP_ACCEPT);
    }

QuorumPeer.start()

    public synchronized void start() {
        loadDataBase();
        cnxnFactory.start();        
        startLeaderElection();
        super.start();
    }

1,首先加载数据 loadDataBase();

2,启动NIO线程;

3,启动leader选举

首先看从本地加载数据loadDataBase();

    public long loadDataBase() throws IOException {
        PlayBackListener listener=new PlayBackListener(){
            public void onTxnLoaded(TxnHeader hdr,Record txn){
                Request r = new Request(null, 0, hdr.getCxid(),hdr.getType(),
                        null, null);
                r.txn = txn;
                r.hdr = hdr;
                r.zxid = hdr.getZxid();
                addCommittedProposal(r);
            }
        };

        long zxid = snapLog.restore(dataTree,sessionsWithTimeouts,listener);
        initialized = true;
        return zxid;
    }

从zk的事务日志snapLog中恢复

 long zxid = snapLog.restore(dataTree,sessionsWithTimeouts,listener);

之后会回调PlayBackListener的onTxnLoaded来onTxnLoaded提交Proposal(addCommittedProposal())

  public long restore(DataTree dt, Map<Long, Integer> sessions, 
            PlayBackListener listener) throws IOException {
        snapLog.deserialize(dt, sessions);
        FileTxnLog txnLog = new FileTxnLog(dataDir);
        TxnIterator itr = txnLog.read(dt.lastProcessedZxid+1);
        long highestZxid = dt.lastProcessedZxid;
        TxnHeader hdr;
        try {
            while (true) {
                // iterator points to 
                // the first valid txn when initialized
                hdr = itr.getHeader();
                if (hdr == null) {
                    //empty logs 
                    return dt.lastProcessedZxid;
                }
                if (hdr.getZxid() < highestZxid && highestZxid != 0) {
                    LOG.error("{}(higestZxid) > {}(next log) for type {}",
                            new Object[] { highestZxid, hdr.getZxid(),
                                    hdr.getType() });
                } else {
                    highestZxid = hdr.getZxid();
                }
                try {
                    processTransaction(hdr,dt,sessions, itr.getTxn());
                } catch(KeeperException.NoNodeException e) {
                   throw new IOException("Failed to process transaction type: " +
                         hdr.getType() + " error: " + e.getMessage(), e);
                }
                listener.onTxnLoaded(hdr, itr.getTxn());
                if (!itr.next()) 
                    break;
            }
        } finally {
            if (itr != null) {
                itr.close();
            }
        }
        return highestZxid;
    }

启动NIOServerCnxnFactory的start()

    public void start() {
        // ensure thread is started once and only once
        if (thread.getState() == Thread.State.NEW) {
            thread.start();
        }
    }

这里的thread其实就是NIOServerCnxnFactory

thread = new Thread(this, "NIOServerCxn.Factory:" + addr);
thread.setDaemon(true);

然后来到run()循环

    public void run() {
        while (!ss.socket().isClosed()) {
            try {
                selector.select(1000);
                Set<SelectionKey> selected;
                synchronized (this) {
                    selected = selector.selectedKeys();
                }
                ArrayList<SelectionKey> selectedList = new ArrayList<SelectionKey>(
                        selected);
                Collections.shuffle(selectedList);
                for (SelectionKey k : selectedList) {
                    if ((k.readyOps() & SelectionKey.OP_ACCEPT) != 0) {
                        SocketChannel sc = ((ServerSocketChannel) k
                                .channel()).accept();
                        InetAddress ia = sc.socket().getInetAddress();
                        int cnxncount = getClientCnxnCount(ia);
                        if (maxClientCnxns > 0 && cnxncount >= maxClientCnxns){
                            LOG.warn("Too many connections from " + ia
                                     + " - max is " + maxClientCnxns );
                            sc.close();
                        } else {
                            LOG.info("Accepted socket connection from "
                                     + sc.socket().getRemoteSocketAddress());
                            sc.configureBlocking(false);
                            SelectionKey sk = sc.register(selector,
                                    SelectionKey.OP_READ);
                            NIOServerCnxn cnxn = createConnection(sc, sk);
                            sk.attach(cnxn);
                            addCnxn(cnxn);
                        }
                    } else if ((k.readyOps() & (SelectionKey.OP_READ | SelectionKey.OP_WRITE)) != 0) {
                        NIOServerCnxn c = (NIOServerCnxn) k.attachment();
                        c.doIO(k);
                    } else {
                        if (LOG.isDebugEnabled()) {
                            LOG.debug("Unexpected ops in select "
                                      + k.readyOps());
                        }
                    }
                }
                selected.clear();
            } catch (RuntimeException e) {
                LOG.warn("Ignoring unexpected runtime exception", e);
            } catch (Exception e) {
                LOG.warn("Ignoring exception", e);
            }
        }
        closeAll();
        LOG.info("NIOServerCnxn factory exited run method");
    }

监听IO,阻塞等待数据到来(1s)

selector.select(1000);
                Set<SelectionKey> selected;
                synchronized (this) {
                    selected = selector.selectedKeys();
                }
                ArrayList<SelectionKey> selectedList = new ArrayList<SelectionKey>(
                        selected);
                Collections.shuffle(selectedList);

如果是连接,则accept,生成则在selector中注册SelectionKey.OP_READ事件 ,在没有达到maxClientCnxns时会创建一个连接createConnection,保存在set中

        LOG.info("Accepted socket connection from "
                                     + sc.socket().getRemoteSocketAddress());
                            sc.configureBlocking(false);
                            SelectionKey sk = sc.register(selector,
                                    SelectionKey.OP_READ);
                            NIOServerCnxn cnxn = createConnection(sc, sk);
                            sk.attach(cnxn);
                            addCnxn(cnxn);

处理OP_READ,OP_WRITE 读写时间

if ((k.readyOps() & (SelectionKey.OP_READ | SelectionKey.OP_WRITE)) != 0) {
                        NIOServerCnxn c = (NIOServerCnxn) k.attachment();
                        c.doIO(k);
                    }
/**
     * Handles read/write IO on connection.
     */
    void doIO(SelectionKey k) throws InterruptedException {
        try {
            if (isSocketOpen() == false) {
                LOG.warn("trying to do i/o on a null socket for session:0x"
                         + Long.toHexString(sessionId));

                return;
            }
            if (k.isReadable()) {
                int rc = sock.read(incomingBuffer);
                if (rc < 0) {
                    throw new EndOfStreamException(
                            "Unable to read additional data from client sessionid 0x"
                            + Long.toHexString(sessionId)
                            + ", likely client has closed socket");
                }
                if (incomingBuffer.remaining() == 0) {
                    boolean isPayload;
                    if (incomingBuffer == lenBuffer) { // start of next request
                        incomingBuffer.flip();
                        isPayload = readLength(k);
                        incomingBuffer.clear();
                    } else {
                        // continuation
                        isPayload = true;
                    }
                    if (isPayload) { // not the case for 4letterword
                        readPayload();
                    }
                    else {
                        // four letter words take care
                        // need not do anything else
                        return;
                    }
                }
            }
            if (k.isWritable()) {
                // ZooLog.logTraceMessage(LOG,
                // ZooLog.CLIENT_DATA_PACKET_TRACE_MASK
                // "outgoingBuffers.size() = " +
                // outgoingBuffers.size());
                if (outgoingBuffers.size() > 0) {
                    // ZooLog.logTraceMessage(LOG,
                    // ZooLog.CLIENT_DATA_PACKET_TRACE_MASK,
                    // "sk " + k + " is valid: " +
                    // k.isValid());

                    /*
                     * This is going to reset the buffer position to 0 and the
                     * limit to the size of the buffer, so that we can fill it
                     * with data from the non-direct buffers that we need to
                     * send.
                     */
                    ByteBuffer directBuffer = factory.directBuffer;
                    directBuffer.clear();

                    for (ByteBuffer b : outgoingBuffers) {
                        if (directBuffer.remaining() < b.remaining()) {
                            /*
                             * When we call put later, if the directBuffer is to
                             * small to hold everything, nothing will be copied,
                             * so we've got to slice the buffer if it's too big.
                             */
                            b = (ByteBuffer) b.slice().limit(
                                    directBuffer.remaining());
                        }
                        /*
                         * put() is going to modify the positions of both
                         * buffers, put we don't want to change the position of
                         * the source buffers (we'll do that after the send, if
                         * needed), so we save and reset the position after the
                         * copy
                         */
                        int p = b.position();
                        directBuffer.put(b);
                        b.position(p);
                        if (directBuffer.remaining() == 0) {
                            break;
                        }
                    }
                    /*
                     * Do the flip: limit becomes position, position gets set to
                     * 0. This sets us up for the write.
                     */
                    directBuffer.flip();

                    int sent = sock.write(directBuffer);
                    ByteBuffer bb;

                    // Remove the buffers that we have sent
                    while (outgoingBuffers.size() > 0) {
                        bb = outgoingBuffers.peek();
                        if (bb == ServerCnxnFactory.closeConn) {
                            throw new CloseRequestException("close requested");
                        }
                        int left = bb.remaining() - sent;
                        if (left > 0) {
                            /*
                             * We only partially sent this buffer, so we update
                             * the position and exit the loop.
                             */
                            bb.position(bb.position() + sent);
                            break;
                        }
                        packetSent();
                        /* We've sent the whole buffer, so drop the buffer */
                        sent -= bb.remaining();
                        outgoingBuffers.remove();
                    }
                    // ZooLog.logTraceMessage(LOG,
                    // ZooLog.CLIENT_DATA_PACKET_TRACE_MASK, "after send,
                    // outgoingBuffers.size() = " + outgoingBuffers.size());
                }

                synchronized(this.factory){
                    if (outgoingBuffers.size() == 0) {
                        if (!initialized
                                && (sk.interestOps() & SelectionKey.OP_READ) == 0) {
                            throw new CloseRequestException("responded to info probe");
                        }
                        sk.interestOps(sk.interestOps()
                                & (~SelectionKey.OP_WRITE));
                    } else {
                        sk.interestOps(sk.interestOps()
                                | SelectionKey.OP_WRITE);
                    }
                }
            }
        } catch (CancelledKeyException e) {
            LOG.warn("Exception causing close of session 0x"
                    + Long.toHexString(sessionId)
                    + " due to " + e);
            if (LOG.isDebugEnabled()) {
                LOG.debug("CancelledKeyException stack trace", e);
            }
            close();
        } catch (CloseRequestException e) {
            // expecting close to log session closure
            close();
        } catch (EndOfStreamException e) {
            LOG.warn("caught end of stream exception",e); // tell user why

            // expecting close to log session closure
            close();
        } catch (IOException e) {
            LOG.warn("Exception causing close of session 0x"
                    + Long.toHexString(sessionId)
                    + " due to " + e);
            if (LOG.isDebugEnabled()) {
                LOG.debug("IOException stack trace", e);
            }
            close();
        }
    }

<think>嗯,用户之前已经问过关于Kafka的ISR和ID的问题,现在又接着问ZooKeeper的工作流程。这说明用户可能是在学习Kafka的整体架构,想要了解各个组件之间的关系。我需要先确定用户对Kafka的基本概念是否熟悉,但根据之前的对话,他们已经了解了一些核心概念,所以可以深入一点。 用户提到的是ZooKeeper在Kafka中的工作流程。首先,我得回忆一下Kafka和ZooKeeper的关系。Kafka在旧版本中重度依赖ZooKeeper,用于元数据管理、控制器选举、Broker注册等。不过在较新版本中,Kafka开始逐步脱离ZooKeeper,比如KIP-500引入的KRaft模式,但用户可能还在使用依赖ZooKeeper的版本,所以需要详细解释传统架构下的流程。 接下来需要梳理ZooKeeper在Kafka中的具体职责。比如Broker注册、Topic配置管理、分区和副本状态、控制器选举、消费者群组协调等。要分步骤说明每个流程是如何工作的,比如Broker启动时如何注册自己,控制器如何选举,分区Leader如何分配,消费者如何利用ZooKeeper协调。 同时要注意用户可能对某些术语不太熟悉,比如控制器(Controller)、ISR(之前已经提过),需要适当关联之前的内容。比如在Broker注册时,ZooKeeper存储Broker的ID和地址,这样控制器可以知道集群中的Broker列表,进而进行分区分配。 还需要考虑用户可能的疑问点,比如为什么需要控制器选举,ZooKeeper如何监听节点变化,当Broker宕机时如何触发故障转移。这些流程需要以步骤的形式呈现,比如Broker注册、控制器选举、元数据同步、故障处理等阶段。 另外,用户可能对ZooKeeper的高可用性和一致性机制感兴趣,比如ZAB协议,但问题主要集中在工作流程,所以可能需要简要提及ZooKeeper如何保证数据一致性,但不过多深入内部机制。 最后,考虑到Kafka正在向去除ZooKeeper的方向发展,可能需要补充说明新版本的变化,让用户了解趋势,但重点还是在传统工作流程上。同时确保回答结构清晰,分点说明,结合示例或参数配置,比如ZooKeeper的路径结构,帮助用户理解存储的数据结构。</think>以下是Kafka依赖ZooKeeper时的核心工作流程详解(基于传统架构,适用于Kafka 2.8之前的版本): --- ### 一、核心角色分工 | 组件 | 职责 | |-------------|------------------------------| | **ZooKeeper** | 分布式协调服务,存储元数据和集群状态 | | **Kafka Broker** | 数据存储与消息处理节点 | | **Controller** | 特殊Broker,通过ZooKeeper选举产生 | --- ### 二、核心工作流程 #### 1. **Broker注册流程** - **注册路径**:`/brokers/ids/[broker.id]` - **步骤**: 1. Broker启动时向ZooKeeper的`/brokers/ids`注册临时节点(Ephemeral Node) 2. 节点内容包含:`{ "host": "192.168.1.10", "port": 9092, "version": 4 }` 3. 若Broker宕机,临时节点自动消失触发集群状态更新 #### 2. **Controller选举** - **选举机制**: 1. 所有Broker竞争创建`/controller`临时节点 2. 第一个成功创建的Broker成为Controller 3. 若Controller宕机,节点消失触发重新选举 - **Controller职责**: - 分区Leader选举 - 副本分配管理 - 集群扩缩容处理 #### 3. **Topic元数据管理** - **存储路径**: - 配置信息:`/config/topics/[topic-name]` - 分区分配:`/brokers/topics/[topic-name]` - **创建流程**: 1. 用户通过`kafka-topics.sh`创建Topic 2. ZooKeeper在`/brokers/topics`下记录分区副本分配方案 3. Controller监听该节点变化,同步元数据到所有Broker #### 4. **消费者群组协调** - **旧版消费者(已淘汰)**: - 偏移量存储路径:`/consumers/[group-id]/offsets/[topic]/[partition]` - 消费者注册路径:`/consumers/[group-id]/ids/[consumer-id]` - **新版消费者(使用__consumer_offsets Topic)**: ZooKeeper仅保留兼容支持,主要数据迁移至Kafka内部Topic #### 5. **分区状态同步** - **ISR变更**: 1. Leader将ISR变化写入`/brokers/topics/[topic]/partitions/[p]/state` 2. Controller监听该节点,触发元数据广播 - **Leader选举**: 1. Controller检测到Leader下线 2. 从ISR中选择新Leader,更新ZooKeeper中的分区状态 --- ### 三、关键ZooKeeper路径结构 ```shell / ├── brokers │ ├── ids # 在线Broker列表(临时节点) │ ├── topics # Topic分区分配方案 │ └── seqid # 全局唯一ID生成 ├── config │ ├── topics # Topic级别配置覆盖 │ └── clients # 客户端级配置 ├── controller # 当前Controller信息(临时节点) ├── controller_epoch # Controller纪元版本(防脑裂) └── admin └── delete_topics # Topic删除标记 ``` --- ### 四、典型场景示例 #### 场景1:Broker故障恢复 1. Broker-3宕机,其`/brokers/ids/3`节点消失 2. Controller检测到节点变化 3. 重新计算受影响分区的ISR集合 4. 若原Leader在Broker-3上,触发分区Leader切换 #### 场景2:Topic扩容 1. 用户执行`kafka-topics --alter --partitions 4` 2. ZooKeeper更新`/brokers/topics/[topic]`数据 3. Controller监听到变更,计算新分区的副本分配 4. 将新分区元数据同步给所有Broker --- ### 五、参数配置关键点 ```properties # ZooKeeper连接配置 zookeeper.connect=zk1:2181,zk2:2181,zk3:2181/kafka # Session超时(默认18秒) zookeeper.session.timeout.ms=18000 # 连接超时(默认6秒) zookeeper.connection.timeout.ms=6000 ``` --- ### 六、Kafka去ZooKeeper化趋势 自Kafka 2.8版本起,通过**KRaft模式**(使用内部Raft协议)逐步替代ZooKeeper,核心变化: 1. 元数据存储在内部`__cluster_metadata` Topic 2. Controller角色分为Active/Standby模式 3. 完全移除ZooKeeper依赖(目标版本3.0+) --- **注**:生产环境若使用Kafka 3.0+版本,建议优先选择KRaft模式以获得更好的性能和运维简化。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值