Apache Hudi初探(十一)(与spark的结合)--hudi的markers机制

原创已于 2023-07-23 14:24:37 修改 · 382 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark #大数据 #hudi

于 2023-07-23 10:24:06 首次发布

大数据同时被 3 个专栏收录

156 篇文章

订阅专栏

spark

96 篇文章

订阅专栏

hudi

9 篇文章

订阅专栏

文章详细阐述了Hudi在使用SparkDataSourceV2时，为何在Compaction操作中会有deleteMarker操作。这涉及到Spark任务的执行流程，包括DataWritingSparkTask的run方法和WriterCommitMessage的处理。Hudi为了避免数据重复和提高写入效率，直接将数据写入目标目录并同时创建Marker文件。在作业完成后，根据Marker文件清理无效数据，并最终删除Marker目录。

背景

在之前的文章中hudi的Compaction操作中，completeTableService中其实会有deleteMarker的操作，那为什么会有这个操作呢？

分析

为什么会存在Marker文件

这得从Spark DataSource V2说起，引入了DataSource V2以后，hudi的写入文件主要就是V2TableWriteExec类：

  sparkContext.runJob(
    rdd,
    (context: TaskContext, iter: Iterator[InternalRow]) =>
      DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator),
    rdd.partitions.indices,
    (index, result: DataWritingSparkTaskResult) => {
      val commitMessage = result.writerCommitMessage
      messages(index) = commitMessage
      totalNumRowsAccumulator.add(result.numRows)
      batchWrite.onDataWriterCommit(commitMessage)
    }
  )

而DataWritingSparkTask.run方法如下：

      while (iter.hasNext) {
        // Count is here.
        count += 1
        dataWriter.write(iter.next())
      }

      val msg = if (useCommitCoordinator) {
        val coordinator = SparkEnv.get.outputCommitCoordinator
        val commitAuthorized = coordinator.canCommit(stageId, stageAttempt, partId, attemptId)
        if (commitAuthorized) {
          logInfo(s"Commit authorized for partition $partId (task $taskId, attempt $attemptId, " +
            s"stage $stageId.$stageAttempt)")
          dataWriter.commit()
        } else {
          val message = s"Commit denied for partition $partId (task $taskId, attempt $attemptId, " +
            s"stage $stageId.$stageAttempt)"
          logInfo(message)
          // throwing CommitDeniedException will trigger the catch block for abort
          throw new CommitDeniedException(message, stageId, partId, attemptId)
        }

      } else {
        logInfo(s"Writer for partition ${context.partitionId()} is committing.")
        dataWriter.commit()
      }

之前的文章也说过，主要的就是以下三重曲：

dataWriter.write
dataWriter.commit/abort
dataWriter.close

这就不得不提到dataWriter这个变量，在Spark原生的类中，该dataWriter对应的为SingleDirectoryDataWriter或者DynamicPartitionDataWriter，
看这两个类的构造方法会有一个FileCommitProtocol类型的commiter，这个commiter，在以上write/commit/close等操作中扮演着重要的作用：
也就是说在task.write的时候，会先创建临时目录，
之后在task.commit的时候会把临时目录的文件真正的移到需要写入的目录下
那反观一下在hudi中，该dataWriter对应的是HoodieBulkInsertDataInternalWriter：

this.bulkInsertWriterHelper = new BulkInsertDataInternalWriterHelper(hoodieTable,
        writeConfig, instantTime, taskPartitionId, taskId, 0, structType, populateMetaFields, arePartitionRecordsSorted);
  
@Override
  public void write(InternalRow record) throws IOException {
    bulkInsertWriterHelper.write(record);
  }

  @Override
  public WriterCommitMessage commit() throws IOException {
    return new HoodieWriterCommitMessage(bulkInsertWriterHelper.getWriteStatuses());
  }

真正进行写操作的是BulkInsertDataInternalWriterHelper,该类的写操作就是直接写真正需要写入的目录，而不是临时目录
那为什么这么做呢？这么做的优点和缺点是什么？
优点：写数据直接写入目的目录，不需要二次拷贝，提高写入的效率
缺点：如果spark存在speculative的情况下，会存在相同的数据写入到不同的文件中，造成数据重复不准确
所以说hudi引入了Markers的机制

marker文件什么时候被创建

在写入真正文件的同时，会在 .hoodie/.temp/instantTime目录下创建maker文件，比如.hoodie/.temp/202307237055/f1.parquet.marker.CREATE,
具体的写入marker文件的在HoodieRowCreateHandle的构造方法中：

HoodiePartitionMetadata partitionMetadata =
          new HoodiePartitionMetadata(
              fs,
              instantTime,
              new Path(writeConfig.getBasePath()),
              FSUtils.getPartitionPath(writeConfig.getBasePath(), partitionPath),
              table.getPartitionMetafileFormat());
      partitionMetadata.trySave(taskPartitionId);

      createMarkerFile(partitionPath, fileName, instantTime, table, writeConfig);

该HoodieRowCreateHandle会在BulkInsertDataInternalWriterHelper.write的方法中被调用。

无效数据文件什么时候被清理

因为存在了marker文件，所以在写入完后需要清理无效的数据文件(会在job运行完清理)，该清理在V2TableWriteExec中的batchWrite.commit方法中,也就是HoodieDataSourceInternalBatchWrite.commit：

@Override
  public void commit(WriterCommitMessage[] messages) {
    List<HoodieWriteStat> writeStatList = Arrays.stream(messages).map(m -> (HoodieWriterCommitMessage) m)
        .flatMap(m -> m.getWriteStatuses().stream().map(HoodieInternalWriteStatus::getStat)).collect(Collectors.toList());
    dataSourceInternalWriterHelper.commit(writeStatList);
  }

数据流如下：

HoodieDataSourceInternalBatchWrite.commit
      ||
      \/
dataSourceInternalWriterHelper.commit
      ||
      \/
SparkRDDWriteClient.commitStats
      ||
      \/

SparkRDDWriteClient.commit
      ||
      \/

SparkRDDWriteClient.finalizeWrite
      ||
      \/

HoodieTable.finalizeWrite
      ||
      \/

HoodieTable.reconcileAgainstMarkers
      ||
      \/

HoodieTable.getInvalidDataPaths
      ||
      \/

markers.createdAndMergedDataPaths

在reconcileAgainstMarkers方法中会根据marker文件删除无效的数据文件
注意一点
虽然说在Executor端写入了多个重复数据的文件，但是因为在只有一个真正的文件会被Driver认可，所以通过最终返回的被driver认可的文件和marker文件求交集就能删除掉其他废弃的文件。具体的和driver交互是否能被认可的代码在DataWritingSparkTask中：

// useCommitCoordinator 默认都是true
  val msg = if (useCommitCoordinator) {
  val coordinator = SparkEnv.get.outputCommitCoordinator
  val commitAuthorized = coordinator.canCommit(stageId, stageAttempt, partId, attemptId)
  if (commitAuthorized) {
    logInfo(s"Commit authorized for partition $partId (task $taskId, attempt $attemptId, " +
      s"stage $stageId.$stageAttempt)")
    dataWriter.commit()

makers目录什么时候被清理

一个job完成以后，我们可以得到真正的写入的文件，这个时候，Marker目录的意义就没有多大了，所以得进行清除
marker被清理的调用链有很多，比如说SparkRDDWriteClient.commitStats中就有清理：

SparkRDDWriteClient.commitStats

      ||
      \/
SparkRDDWriteClient.postCommit

      ||
      \/
WriteMarkers.quietDeleteMarkerDir

quietDeleteMarkerDir就会直接删除marker目录

更多关于hudi marker的问题，可以参考Apache Hudi内核之文件标记机制深入解析