DelayedOperationPurgatory机制（七）：DelayedFetch实现

最新推荐文章于 2025-08-15 23:13:53 发布

原创最新推荐文章于 2025-08-15 23:13:53 发布 · 344 阅读

0 ·

CC 4.0 BY-SA版权

Kafka-0.10 专栏收录该内容

73 篇文章

订阅专栏

本文深入解析Kafka中DelayedFetch机制的工作原理，包括关键字段、tryComplete方法的执行逻辑及条件，以及onComplete方法如何生成FetchResponse并返回至客户端。

DelayFetch的主要字段如下：

class DelayedFetch(
                   // 延迟操作的延迟时长
                   delayMs: Long,
                   // 为FetchRequest中所有相关分区记录了相关状态，主要用来判断DelayedProduce是否满足执行条件
                   fetchMetadata: FetchMetadata,
                   replicaManager: ReplicaManager,
                   // 满足条件或者到期执行时，在onComplete方法中调用回调函数，其主要功能是创建FetchResponse并添加到RequestChannels中对应的responseQueue中
                   responseCallback: Map[TopicAndPartition, FetchResponsePartitionData] => Unit)
  extends DelayedOperation(delayMs) {}

delayedFetch.tryComplete方法主要检测是否满足elayedFetch，并在满足时执行forceComplete调用forceComplete方法。判断条件有四条之一即满足：

  /**
   * The operation can be completed if:
   *
   * Case A: This broker is no longer the leader for some partitions it tries to fetch
   * Case B: This broker does not know of some partitions it tries to fetch
   * Case C: The fetch offset locates not on the last segment of the log
   * Case D: The accumulated bytes from all the fetching partitions exceeds the minimum bytes
   *
   * Upon completion, should return whatever data is available for each valid partition
   */
  override def tryComplete() : Boolean = {
    var accumulatedSize = 0
    // 遍历fetchMetadata中所有Partition的状态
    fetchMetadata.fetchPartitionStatus.foreach {
      case (topicAndPartition, fetchStatus) =>
        // 获取前面读取log时的结束位置
        val fetchOffset = fetchStatus.startOffsetMetadata
        try {
          if (fetchOffset != LogOffsetMetadata.UnknownOffsetMetadata) {
              // 查找分区的leader副本，如果找不到就抛异常
            val replica = replicaManager.getLeaderReplicaIfLocal(topicAndPartition.topic, topicAndPartition.partition)
            // 根据FetchRequest请求的来源设置能读取的最大offset值。消费者对应的endOffset是HW，而Follower副本对应的endOffset是LEO
            val endOffset =
              if (fetchMetadata.fetchOnlyCommitted)
                replica.highWatermark
              else
                replica.logEndOffset

            // Go directly to the check for Case D if the message offsets are the same. If the log segment
            // has just rolled, then the high watermark offset will remain the same but be on the old segment,
            // which would incorrectly be seen as an instance of Case C.
            // 检查上次读取后endOffset是否发生变化。如果没改变，之前读不到足够的数据现在还是读不到，即任务条件依然不满足；如果变了则继续下面的检查
            if (endOffset.messageOffset != fetchOffset.messageOffset) {
                // 条件一：开始读取的offset不在activeSegment中
              if (endOffset.onOlderSegment(fetchOffset)) {// 可能是发生了Log截断
                // Case C, this can happen when the new fetch operation is on a truncated leader
                debug("Satisfying fetch %s since it is fetching later segments of partition %s.".format(fetchMetadata, topicAndPartition))
                return forceComplete()
              } else if (fetchOffset.onOlderSegment(endOffset)) { // fetchOffset虽然在endOffset之前，但是产生了新的activeSegment，fetchOffset在旧的logSegments，而endOffset在新的logSegments
                // Case C, this can happen when the fetch operation is falling behind the current segment
                // or the partition has just rolled a new segment
                debug("Satisfying fetch %s immediately since it is fetching older segments.".format(fetchMetadata))
                return forceComplete()
              } else if (fetchOffset.messageOffset < endOffset.messageOffset) {
                // we need take the partition fetch size as upper bound when accumulating the bytes
                // 开始读取的offset和endOffset在同一个activeSegment中，切endOffsets向后移动，那就尝试计算累计的字节数
                accumulatedSize += math.min(endOffset.positionDiff(fetchOffset), fetchStatus.fetchInfo.fetchSize)
              }
            }
          }
        } catch {
          case utpe: UnknownTopicOrPartitionException => // Case B 当前broker找不到需要读取数据的分区副本
            debug("Broker no longer know of %s, satisfy %s immediately".format(topicAndPartition, fetchMetadata))
            return forceComplete()
          case nle: NotLeaderForPartitionException =>  // Case A 发生leader副本迁移
            debug("Broker is no longer the leader of %s, satisfy %s immediately".format(topicAndPartition, fetchMetadata))
            return forceComplete()
        }
    }

    // Case D 累计读取字节数超过最小字节数限制
    if (accumulatedSize >= fetchMetadata.fetchMinBytes)
      forceComplete()
    else
      false
  }

DelayedFetch.onComplete方法如下：

  override def onComplete() {
      // 重新从log中读取数据
    val logReadResults = replicaManager.readFromLocalLog(fetchMetadata.fetchOnlyLeader,
      fetchMetadata.fetchOnlyCommitted,
      fetchMetadata.fetchPartitionStatus.mapValues(status => status.fetchInfo))
    // 将结果进行封装
    val fetchPartitionData = logReadResults.mapValues(result =>
      FetchResponsePartitionData(result.errorCode, result.hw, result.info.messageSet))
    // 调用回调函数
    responseCallback(fetchPartitionData)
  }

// the callback for sending a fetch response
def sendResponseCallback(responsePartitionData: Map[TopicAndPartition, FetchResponsePartitionData]) {

  val convertedPartitionData =
    // Need to down-convert message when consumer only takes magic value 0.
    if (fetchRequest.versionId <= 1) {
      responsePartitionData.map { case (tp, data) =>

        // We only do down-conversion when:
        // 1. The message format version configured for the topic is using magic value > 0, and
        // 2. The message set contains message whose magic > 0
        // This is to reduce the message format conversion as much as possible. The conversion will only occur
        // when new message format is used for the topic and we see an old request.
        // Please note that if the message format is changed from a higher version back to lower version this
        // test might break because some messages in new message format can be delivered to consumers before 0.10.0.0
        // without format down conversion.
        val convertedData = if (replicaManager.getMessageFormatVersion(tp).exists(_ > Message.MagicValue_V0) &&
          !data.messages.isMagicValueInAllWrapperMessages(Message.MagicValue_V0)) {
          trace(s"Down converting message to V0 for fetch request from ${fetchRequest.clientId}")
          new FetchResponsePartitionData(data.error, data.hw, data.messages.asInstanceOf[FileMessageSet].toMessageFormat(Message.MagicValue_V0))
        } else data

        tp -> convertedData
      }
    } else responsePartitionData

  val mergedPartitionData = convertedPartitionData ++ unauthorizedPartitionData

  mergedPartitionData.foreach { case (topicAndPartition, data) =>
    if (data.error != Errors.NONE.code)
      debug(s"Fetch request with correlation id ${fetchRequest.correlationId} from client ${fetchRequest.clientId} " +
        s"on partition $topicAndPartition failed due to ${Errors.forCode(data.error).exceptionName}")
    // record the bytes out metrics only when the response is being sent
    BrokerTopicStats.getBrokerTopicStats(topicAndPartition.topic).bytesOutRate.mark(data.messages.sizeInBytes)
    BrokerTopicStats.getBrokerAllTopicsStats().bytesOutRate.mark(data.messages.sizeInBytes)
  }
   // 定义fetchResponseCallback函数
  def fetchResponseCallback(delayTimeMs: Int) {
    trace(s"Sending fetch response to client ${fetchRequest.clientId} of " +
      s"${convertedPartitionData.values.map(_.messages.sizeInBytes).sum} bytes")
      // 生成fetchResponse对象
    val response = FetchResponse(fetchRequest.correlationId, mergedPartitionData, fetchRequest.versionId, delayTimeMs)
    // 向赌赢responseQueue中添加一个SendAction的response，其中封装了上面的response对象
    requestChannel.sendResponse(new RequestChannel.Response(request, new FetchResponseSend(request.connectionId, response)))
  }


  // When this callback is triggered, the remote API call has completed
  request.apiRemoteCompleteTimeMs = SystemTime.milliseconds

  // Do not throttle replication traffic
  if (fetchRequest.isFromFollower) {
      // 调用fetchResponseCallback返回FetchResponse
    fetchResponseCallback(0)
  } else {
      // 底层也是调用fetchResponseCallback
    quotaManagers(ApiKeys.FETCH.id).recordAndMaybeThrottle(fetchRequest.clientId,
                                                           FetchResponse.responseSize(mergedPartitionData.groupBy(_._1.topic),
                                                                                      fetchRequest.versionId),
                                                           fetchResponseCallback)
  }
}

DelayFetch的流程如下：
1. Follower副本或消费者发送FetchRequest，从某些分区中获取消息
2. FetchRequest经过网络层和API层的处理，到达ReplicaManager，他会从日志存储子系统中读取数据，并检测是否要更新ISR集合、HW等，之后还会执行delayedProducePurgatory中满足条件相关的DelayedProduce
3. 日志存储子系统返回读取消息以及相关信息，例如读取到的offset等
4. ReplicaManager为FetchRequest生成DelayedFetch对象，并交由delayedFetchPurgatory管理
5. delayedFetchPurgatory使用SystemTimer管理DelayedFetch是否超时
6. 生产者发送ProduceRequest追加消息，同时也会检查DelayedFetch是否满足执行条件
7. DelayFetch执行时会调用回调函数产生FetchResponse，添加到RequestChannels中
8. 有网络层将FetchResponse返回到客户端