KafkaController机制（三）：PartitionStateMachine

最新推荐文章于 2024-07-20 18:08:51 发布

其实系一个须刨

最新推荐文章于 2024-07-20 18:08:51 发布

阅读量985

点赞数

CC 4.0 BY-SA版权

分类专栏： Kafka-0.10

本文链接：https://blog.youkuaiyun.com/lianggx3/article/details/101324181

Kafka-0.10 专栏收录该内容

73 篇文章

订阅专栏

本文介绍Kafka中PartitionStateMachine如何管理分区状态，包括状态转换逻辑及核心方法handleStateChange的详细流程。阐述了分区从创建到上线的服务过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

PartitionStateMachine是Controller Leader用于维护分区状态的状态机，分区状态是通过PartitionState定义的，它有四个子类分别代表四种可能的状态
NonExistentPartition：分区从来没有被创建，或者创建之后又被删除了
NewPartition：分区被创建后处于此状态，此时分区已经被分配了AR集合，但是还没有制定Leader副本和ISR
OfflinePartition：已经成功选举出分区的Leader副本后，但是Leader副本发生宕机。
OnlinePartition：分区成功选举出Leader后，
下面是状态装换时需要完成的操作：
NonExistentPartition->NewPartition
从ZK中加载AR分区到ControllerContext的partitionReplicaAssignment集合中。
NewPartition->OnlinePartition
首先将Leader副本和ISR集合写入到ZK中，这里会将分区的AR集合中的第一个可用的副本选举为Leader副本，并将分区的所有可用副本作为ISR集合。之后，向所有可用的副本发送LeaderAndISRRequest，指导这些副本进行Leader/Follower的角色切换，并向所有可用的Broker发送UpdateMetadataRequest来更新其上的MeatadataCache。
OnlinePartition/OfflinePartition->OnlinePartition
为分区选择新的 Leader副本和ISR集合，并把结果写入ZK中，之后，向所有可用的副本发送LeaderAndISRRequest，指导这些副本进行Leader/Follower的角色切换，并向所有可用的Broker发送UpdateMetadataRequest来更新其上的MeatadataCache。
NewPartition/OnlinePartition->OnlinePartition
只进行状态转换，并没有其他操作。
OfflinePartition->NonExistentPartition
只进行状态转换，并没有其他操作。

class PartitionStateMachine(controller: KafkaController) extends Logging {
  private val controllerContext = controller.controllerContext //维护KafkaController的上下文信息。
  private val controllerId = controller.config.brokerId
  private val zkUtils = controllerContext.zkUtils //ZK的客户端，与zk进行交互。
  private val partitionState: mutable.Map[TopicAndPartition, PartitionState] = mutable.Map.empty//记录每个分区对应的ParititionState状态。
  private val brokerRequestBatch = new ControllerBrokerRequestBatch(controller)//对于向指定的Broker批量发送请求。
  private val hasStarted = new AtomicBoolean(false)
  private val noOpPartitionLeaderSelector = new NoOpLeaderSelector(controllerContext)//默认的Leader副本选举类器，集成了PartitionLeaderSelector。noOpPartitionLeaderSelector其实没有实现真正的Leader选举，只是返回当前leader副本、isr集合和AR集合。
  private val topicChangeListener = new TopicChangeListener()//Zookeeper监听器，监听Topic变化。
  private val deleteTopicsListener = new DeleteTopicsListener()//Zookeeper监听器，监听Topic删除。
  private val partitionModificationsListeners: mutable.Map[String, PartitionModificationsListener] = mutable.Map.empty
}

PartitionStateMachine在启动时会对partitionState进行初始化，并调用triggerOnlinePartitionStateChange()方法把NewPartition和OfflinePartition和OfflinePartition状态的分区转换成Online状态。

  def startup() {
    // initialize partition state
    initializePartitionState()
    // set started flag
    hasStarted.set(true)
    // try to move partitions to online state
    triggerOnlinePartitionStateChange()

    info("Started partition state machine with initial state -> " + partitionState.toString())
  }
  private def initializePartitionState() {
      //遍历集群中的所有分区
    for((topicPartition, replicaAssignment) <- controllerContext.partitionReplicaAssignment) {
      // check if leader and isr path exists for partition. If not, then it is in NEW state
      controllerContext.partitionLeadershipInfo.get(topicPartition) match {
        case Some(currentLeaderIsrAndEpoch) =>//存在Leader副本和ISR集合的信息
          // else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state
          controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader) match {
              // Leader副本所在的Broker可用，初始化为OnlinePartition状态
            case true => // leader is alive
              partitionState.put(topicPartition, OnlinePartition)
              // Leader副本所在的Broker不可用，初始化为OnlinePartition状态
            case false =>
              partitionState.put(topicPartition, OfflinePartition)
          }
          // 没有Leader副本和ISR集合的信息
        case None =>
          partitionState.put(topicPartition, NewPartition)
      }
    }
  }

PartitionStateMachine.handleStateChange()方法是管理分区状态的核心方法，该方法控制着PartitionState的转换。

  private def handleStateChange(topic: String, partition: Int, targetState: PartitionState,
                                leaderSelector: PartitionLeaderSelector,
                                callbacks: Callbacks) {
    val topicAndPartition = TopicAndPartition(topic, partition)
    // 检测当前PartitionStaeMachine是否已经启动，只有ControllerLeader的PartitionStateMachine对象才启动，如果没有，就直接抛出异常
    if (!hasStarted.get)
      throw new StateChangeFailedException(("Controller %d epoch %d initiated state change for partition %s to %s failed because " +
                                            "the partition state machine has not started")
                                              .format(controllerId, controller.epoch, topicAndPartition, targetState))
    // 从partitionState集合中获取分区的状态，如果没有对应的状态则初始化
    val currState = partitionState.getOrElseUpdate(topicAndPartition, NonExistentPartition)
    try {
        // 在转换开始之前，会根据targetState检查分区的前置状态是否合法
      targetState match {
        case NewPartition =>//将分区状态设置为NewPartition
          // pre: partition did not exist before this
          assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)
          partitionState.put(topicAndPartition, NewPartition)
          val assignedReplicas = controllerContext.partitionReplicaAssignment(topicAndPartition).mkString(",")
          stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s with assigned replicas %s"
                                    .format(controllerId, controller.epoch, topicAndPartition, currState, targetState,
                                            assignedReplicas))
          // post: partition has been assigned replicas
        case OnlinePartition =>
          assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)
          partitionState(topicAndPartition) match {
            case NewPartition => // 为分区初始化Leader和ISR集合
              // initialize leader and isr path for new partition
              initializeLeaderAndIsrForPartition(topicAndPartition)
            case OfflinePartition => //选举出新的leader副本
              electLeaderForPartition(topic, partition, leaderSelector)
            case OnlinePartition => // invoked when the leader needs to be re-elected //重新选举新的leader副本
              electLeaderForPartition(topic, partition, leaderSelector)
            case _ => // should never come here since illegal previous states are checked above
          }
          // 将分区状态设置为OnlinePartition
          partitionState.put(topicAndPartition, OnlinePartition)
          val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader
          stateChangeLogger.trace("Controller %d epoch %d changed partition %s from %s to %s with leader %d"
                                    .format(controllerId, controller.epoch, topicAndPartition, currState, targetState, leader))
           // post: partition has a leader
        case OfflinePartition =>//分区状态置为Offline
          // pre: partition should be in New or Online state
          assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)
          // should be called when the leader for a partition is no longer alive
          stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
                                    .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
          partitionState.put(topicAndPartition, OfflinePartition)
          // post: partition has no alive leader
        case NonExistentPartition => //将分区状态设置为NonExistentPartition
          // pre: partition should be in Offline state
          assertValidPreviousStates(topicAndPartition, List(OfflinePartition), NonExistentPartition)
          stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
                                    .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
          partitionState.put(topicAndPartition, NonExistentPartition)
          // post: partition state is deleted from all brokers and zookeeper
      }
    } catch {
      case t: Throwable =>
        stateChangeLogger.error("Controller %d epoch %d initiated state change for partition %s from %s to %s failed"
          .format(controllerId, controller.epoch, topicAndPartition, currState, targetState), t)
    }
  }

下面是状态装换的具体操作。
NewPartition->OnlinePartition

  /**
   * Invoked on the NewPartition->OnlinePartition state change. When a partition is in the New state, it does not have
   * a leader and isr path in zookeeper. Once the partition moves to the OnlinePartition state, its leader and isr
   * path gets initialized and it never goes back to the NewPartition state. From here, it can only go to the
   * OfflinePartition state.
   * @param topicAndPartition   The topic/partition whose leader and isr path is to be initialized
   */
  private def initializeLeaderAndIsrForPartition(topicAndPartition: TopicAndPartition) {
      //获取分区的AR集合信息
    val replicaAssignment = controllerContext.partitionReplicaAssignment(topicAndPartition)
    // 获取AR集合中可用副本集合
    val liveAssignedReplicas = replicaAssignment.filter(r => controllerContext.liveBrokerIds.contains(r))
    liveAssignedReplicas.size match {
      case 0 =>//没有可用的副本，抛出异常
        val failMsg = ("encountered error during state change of partition %s from New to Online, assigned replicas are [%s], " +
                       "live brokers are [%s]. No assigned replica is alive.")
                         .format(topicAndPartition, replicaAssignment.mkString(","), controllerContext.liveBrokerIds)
        stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
        throw new StateChangeFailedException(failMsg)
      case _ =>
        debug("Live assigned replicas for partition %s are: [%s]".format(topicAndPartition, liveAssignedReplicas))
        // make the first replica in the list of assigned replicas, the leader
        //把可用的AR集合中的第一个副本选举为Leader
        val leader = liveAssignedReplicas.head
        // 创建LeaderISRAndControllerEpoch对象，其中ISR集合是可用的AR集合
        // leaderEpoch和ZKversion都初始化为0
        val leaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(new LeaderAndIsr(leader, liveAssignedReplicas.toList),
          controller.epoch)
        debug("Initializing leader and isr for partition %s to %s".format(topicAndPartition, leaderIsrAndControllerEpoch))
        try {
            // 将LeaderIsrAndControllerEpoch中的信息转换成JSON格式保存到ZK中。
          zkUtils.createPersistentPath(
            getTopicPartitionLeaderAndIsrPath(topicAndPartition.topic, topicAndPartition.partition),
            zkUtils.leaderAndIsrZkData(leaderIsrAndControllerEpoch.leaderAndIsr, controller.epoch))
          // NOTE: the above write can fail only if the current controller lost its zk session and the new controller
          // took over and initialized this partition. This can happen if the current controller went into a long
          // GC pause
          //更新partitionLeaderShipInfo记录
          controllerContext.partitionLeadershipInfo.put(topicAndPartition, leaderIsrAndControllerEpoch)
          // 添加LeaderAndISRRequest，待发送
          brokerRequestBatch.addLeaderAndIsrRequestForBrokers(liveAssignedReplicas, topicAndPartition.topic,
            topicAndPartition.partition, leaderIsrAndControllerEpoch, replicaAssignment)
        } catch {
          case e: ZkNodeExistsException =>
            // read the controller epoch
            val leaderIsrAndEpoch = ReplicationUtils.getLeaderIsrAndEpochForPartition(zkUtils, topicAndPartition.topic,
              topicAndPartition.partition).get
            val failMsg = ("encountered error while changing partition %s's state from New to Online since LeaderAndIsr path already " +
                           "exists with value %s and controller epoch %d")
                             .format(topicAndPartition, leaderIsrAndEpoch.leaderAndIsr.toString(), leaderIsrAndEpoch.controllerEpoch)
            stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
            throw new StateChangeFailedException(failMsg)
        }
    }
  }

当前OfflinePartition或OnlinePartition切换为OnlinePartition时调用electLeaderForPartition方法：

   /**
   * Invoked on the OfflinePartition,OnlinePartition->OnlinePartition state change.
   * It invokes the leader election API to elect a leader for the input offline partition
   * @param topic               The topic of the offline partition
   * @param partition           The offline partition
   * @param leaderSelector      Specific leader selector (e.g., offline/reassigned/etc.)
   */
  def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {
    val topicAndPartition = TopicAndPartition(topic, partition)
    // handle leader election for the partitions whose leader is no longer alive
    stateChangeLogger.trace("Controller %d epoch %d started leader election for partition %s"
                              .format(controllerId, controller.epoch, topicAndPartition))
    try {
      var zookeeperPathUpdateSucceeded: Boolean = false
      var newLeaderAndIsr: LeaderAndIsr = null
      var replicasForThisPartition: Seq[Int] = Seq.empty[Int]
      while(!zookeeperPathUpdateSucceeded) {
          //从Zookeeper中获取当前的leader副本、isr集合、zkVersion等信息，不存在则抛出异常
        val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition)
        val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr
        val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch
        if (controllerEpoch > controller.epoch) {
            //检测当前zk上的这个topic的controllerEpoch，然后和当前的ControllerEpoch比较
          val failMsg = ("aborted leader election for partition [%s,%d] since the LeaderAndIsr path was " +
                         "already written by another controller. This probably means that the current controller %d went through " +
                         "a soft failure and another controller was elected with epoch %d.")
                           .format(topic, partition, controllerId, controllerEpoch)
          stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
          throw new StateChangeFailedException(failMsg)
        }
        // elect new leader or throw exception
        //使用指定的PartitionLeaderSelector以及当前的LeaderAndIsr信息，选举新的leader和isr集合。
        val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr)
        // 把选举出新的leader和isr集合写入ZK。
        val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partition,
          leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)
        newLeaderAndIsr = leaderAndIsr
        newLeaderAndIsr.zkVersion = newVersion
        zookeeperPathUpdateSucceeded = updateSucceeded
        replicasForThisPartition = replicas
      }
      val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)
      // update the leader cache
      // 更新partitionLeaderShipInfo记录
      controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)
      stateChangeLogger.trace("Controller %d epoch %d elected leader %d for Offline partition %s"
                                .format(controllerId, controller.epoch, newLeaderAndIsr.leader, topicAndPartition))
      val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))
      // store new leader and isr info in cache
      // 添加LeaderAndISRRequest，待发送
      brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,
        newLeaderIsrAndControllerEpoch, replicas)
    } catch {
      case lenne: LeaderElectionNotNeededException => // swallow
      case nroe: NoReplicaOnlineException => throw nroe
      case sce: Throwable =>
        val failMsg = "encountered error while electing leader for partition %s due to: %s.".format(topicAndPartition, sce.getMessage)
        stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
        throw new StateChangeFailedException(failMsg, sce)
    }
    debug("After leader election, leader cache is updated to %s".format(controllerContext.partitionLeadershipInfo.map(l => (l._1, l._2))))
  }

triggerOnlinePartitionStateChange方法对partitionState集合中的全部分区进行遍历，把OfflinePartition和NewPartition状态的分区转换成Online状态，装换成功就会对外提供服务。

    /**
   * This API invokes the OnlinePartition state change on all partitions in either the NewPartition or OfflinePartition
   * state. This is called on a successful controller election and on broker changes
   */
  def triggerOnlinePartitionStateChange() {
    try {
        //检车ControllerBrokerRequestBatch中的三个集合是否为空
      brokerRequestBatch.newBatch()
      // try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions
      // that belong to topics to be deleted
      for((topicAndPartition, partitionState) <- partitionState
          if(!controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic))) {
        if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
            // 调用handleStateChange方法对指定分区进行状态转换
          handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
                            (new CallbackBuilder).build) //状态转换为OnlinePartition
      }
      brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
    } catch {
      case e: Throwable => error("Error while moving some partitions to the online state", e)
      // TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions
    }
  }