RocksDB Compaction流程

原创已于 2025-08-09 09:25:33 修改 · 857 阅读

23 ·

CC 4.0 BY-SA版权

文章标签：

#数据库架构 #c++ #数据库开发

于 2025-08-09 09:22:44 首次发布

RocksDB源码分析专栏收录该内容

6 篇文章

订阅专栏

引言

谈到comapction，自然而然有几个问题我们很想知道

每个column family有属于自己的lsm，那么对于一个cfd而言，它什么时候需要compaction？
如果此时不止一层需要compaction，该如何选定层？
选定起始层后，如何挑选参与的sst？
如何从选定的sst读取kv，合并排序后再写入新的sst？

本篇文章将顺着这个脉络依次讲解。

回顾前文分析Flush流程，在 DBImpl::FlushMemTableToOutputFile中，flush执行后(FlushJob::Run)会调用 DBImpl::InstallSuperVersionAndScheduleWork更新super version

// db/db_impl/db_impl_compaction_flush.cc

Status DBImpl::FlushMemTableToOutputFile(
    ColumnFamilyData* cfd, const MutableCFOptions& mutable_cf_options,
    bool* made_progress, JobContext* job_context, FlushReason flush_reason,
    SuperVersionContext* superversion_context,
    std::vector<SequenceNumber>& snapshot_seqs,
    SequenceNumber earliest_write_conflict_snapshot,
    SnapshotChecker* snapshot_checker, LogBuffer* log_buffer,
    Env::Priority thread_pri) {
  mutex_.AssertHeld(); // 仍然持有锁

  // ...
  // 生成flush任务
  FlushJob flush_job(
      dbname_, cfd, immutable_db_options_, mutable_cf_options, ...);

  //...
  if (s.ok()) {
    // 运行flush任务
    s = flush_job.Run(&logs_with_prep_tracker_, &file_meta,
                      &switched_to_mempurge, &skip_set_bg_error,
                      &error_handler_);
    need_cancel = false;
  }

  // ...
  if (s.ok()) {
    // 更新super version，调度可能的compaction
    InstallSuperVersionAndScheduleWork(cfd, superversion_context,
                                       mutable_cf_options);
    // ...
  }

  // ...
  return s;
}

除了更新super version，它还有个重要作用便是触发、调度可能的compaction。毕竟随着flush的进行，L0的sst数目可能达到了触发compaction的阈值，该阈值由 Option::level0_file_num_compaction_trigger控制。并且，此函数不只在flush后被调用，还会在compaction后被调用——一个compaction结束后又调度新的compaction。一个原因是，此次compaction可能会让下层大小超了，还需要接着compaction，正所谓cascade compaction。由此可见，对于由flush或compaction使下层大小超过了额度而产生的新compaction，源头都可以是DBImpl::InstallSuperVersionAndScheduleWork。注意此处用词“可以是”，因为compaction能手动触发。

再次注意，这个函数是在线程池里被调用，即super version的更新是在flush或compaction线程里，而不是主线程。

函数调用顺序

DBImpl::InstallSuperVersionAndScheduleWork

// db/db_impl/db_impl_compaction_flush.cc

void DBImpl::InstallSuperVersionAndScheduleWork(
    ColumnFamilyData* cfd, SuperVersionContext* sv_context,
    const MutableCFOptions& mutable_cf_options) {
  // 持有锁
  mutex_.AssertHeld();

  // Update max_total_in_memory_state_
  size_t old_memtable_size = 0;
  auto* old_sv = cfd->GetSuperVersion();
  if (old_sv) {
    old_memtable_size = old_sv->mutable_cf_options.write_buffer_size *
                        old_sv->mutable_cf_options.max_write_buffer_number;
  }

  // this branch is unlikely to step in
  if (UNLIKELY(sv_context->new_superversion == nullptr)) {
    sv_context->NewSuperVersion();
  }
  cfd->InstallSuperVersion(sv_context, mutable_cf_options);

  // There may be a small data race here. The snapshot tricking bottommost
  // compaction may already be released here. But assuming there will always be
  // newer snapshot created and released frequently, the compaction will be
  // triggered soon anyway.
  // 这段不懂
  // ...

  // Whenever we install new SuperVersion, we might need to issue new flushes or
  // compactions.
  SchedulePendingCompaction(cfd);
  MaybeScheduleFlushOrCompaction();

  // Update max_total_in_memory_state_
  max_total_in_memory_state_ = max_total_in_memory_state_ - old_memtable_size +
                               mutable_cf_options.write_buffer_size *
                                   mutable_cf_options.max_write_buffer_number;
}

这段代码的逻辑是：

检查是否持有锁
调用 ColumnFamilyData::InstallSuperVersion给传入的 cfd安装新的super version
调用 DBImpl::SchedulePendingCompaction检查传入的 cfd是否需要compaction，如果是，则将其加入 DBImpl::compaction_queue_
调用 DBImpl::MaybeScheduleFlushOrCompaction对新的compaction请求进行调度，由此形成了闭环！当然，这也可能会调度新的flush。

再次提醒，接下来的函数都是在持有 DBImpl::mutex_完成。关于super version如何更新，之后会单独开一篇文章来讲解。

判断compaction

DBImpl::SchedulePendingCompaction

// db/db_impl/db_impl_compaction_flush.cc
 
void DBImpl::SchedulePendingCompaction(ColumnFamilyData* cfd) {
  mutex_.AssertHeld();
  if (reject_new_background_jobs_) {
    return;
  }
  if (!cfd->queued_for_compaction() && cfd->NeedsCompaction()) {
    AddToCompactionQueue(cfd);
    ++unscheduled_compactions_;
  }
}

该函数做了三件事：

调用 ColumnFamilyData::queued_for_compaction检查传入的column famliy是否已经加入 DBImpl::compaction_queue_。以此避免重复加入
调用 ColumnFamilyData::NeedsCompaction检查此column family是否需要compaction
如果上述两个条件都满足，则调用 DBImpl::AddToCompactionQueue将此column family加入 DBImpl::compaction_queue_，并让 DBImpl::unscheduled_comapctions_加一

从命名上看，该函数似乎和 DBImpl::SchedulePendingFlush对等，其实不然。DBImpl::compaction_queue_存放的是column family，而 DBImpl::flush_queue_存放的是 FlushRequest，而一个 FlushRequest可以包含多个column family。

判断column family是否需要compaction，这通过 ColumnFamilyData::NeedsCompaction完成

ColumnFamilyData::NeedsCompaction

// db/column_family.cc

bool ColumnFamilyData::NeedsCompaction() const {
  return !mutable_cf_options_.disable_auto_compactions &&
         compaction_picker_->NeedsCompaction(current_->storage_info());
}

实际上通过指针调用了 CompactionPicker::NeedsCompaction，这又是一个纯虚函数。由于Rocksdb自带三种compaction策略，因此这个函数在rocksdb里有三种实现。我们先看默认的Level方式。

LevelCompactionPicker::NeedsCompaction

// db/compaction/compaction_picker_level.cc

bool LevelCompactionPicker::NeedsCompaction(
    const VersionStorageInfo* vstorage) const {
  if (!vstorage->ExpiredTtlFiles().empty()) {
    return true;
  }
  if (!vstorage->FilesMarkedForPeriodicCompaction().empty()) {
    return true;
  }
  if (!vstorage->BottommostFilesMarkedForCompaction().empty()) {
    return true;
  }
  if (!vstorage->FilesMarkedForCompaction().empty()) {
    return true;
  }
  if (!vstorage->FilesMarkedForForcedBlobGC().empty()) {
    return true;
  }
  for (int i = 0; i <= vstorage->MaxInputLevel(); i++) {
    if (vstorage->CompactionScore(i) >= 1) {
      return true;
    }
  }
  return false;
}

先暂时忽略其中的if，只关注for循环，查看VersionStorageInfo::CompactionScore

VersionStorageInfo::CompactionScore

// db/version_set.h

class VersionStorageInfo {
  // ...

  // Return idx'th highest score
  double CompactionScore(int idx) const { return compaction_score_[idx]; }
  
  // ...
}

正如注释所言，for循环是在从高到低查看分数，只要最大分数大于等于1，就认为需要compaction。所以，VersionStorageInfo::compaction_score_是个从大到小的有序数组。这个数组在VersionStorageInfo::ComputeCompactionScore中被修改

VersionStorageInfo::ComputeCompactionScore

// db/version_set.cc

void VersionStorageInfo::ComputeCompactionScore(
    const ImmutableOptions& immutable_options,
    const MutableCFOptions& mutable_cf_options) {
  double total_downcompact_bytes = 0.0;
  const double kScoreScale = 10.0;
  int max_output_level = MaxOutputLevel(immutable_options.allow_ingest_behind);
  for (int level = 0; level <= MaxInputLevel(); level++) {
    double score;
    if (level == 0) {
      int num_sorted_runs = 0;
      uint64_t total_size = 0;
      for (auto* f : files_[level]) {
        total_downcompact_bytes += static_cast<double>(f->fd.GetFileSize());
        if (!f->being_compacted) {
          total_size += f->compensated_file_size;
          num_sorted_runs++;
        }
      }

      if (compaction_style_ == kCompactionStyleUniversal) {
        // ...
      }

      if (compaction_style_ == kCompactionStyleFIFO) {
        // ...
      } else {
        score = static_cast<double>(num_sorted_runs) /
                mutable_cf_options.level0_file_num_compaction_trigger;
        if (compaction_style_ == kCompactionStyleLevel && num_levels() > 1) {
          if (immutable_options.level_compaction_dynamic_level_bytes) {
            if (total_size >= mutable_cf_options.max_bytes_for_level_base) {
              // When calculating estimated_compaction_needed_bytes, we assume
              // L0 is qualified as pending compactions. We will need to make
              // sure that it qualifies for compaction.
              // It might be guaranteed by logic below anyway, but we are
              // explicit here to make sure we don't stop writes with no
              // compaction scheduled.
              score = std::max(score, 1.01);
            }
            if (total_size > level_max_bytes_[base_level_]) {
              // In this case, we compare L0 size with actual LBase size and
              // make sure score is more than 1.0 (10.0 after scaled) if L0 is
              // larger than LBase. Since LBase score = LBase size /
              // (target size + total_downcompact_bytes) where
              // total_downcompact_bytes = total_size > LBase size,
              // LBase score is lower than 10.0. So L0->LBase is prioritized
              // over LBase -> LBase+1.
              uint64_t base_level_size = 0;
              for (auto f : files_[base_level_]) {
                base_level_size += f->compensated_file_size;
              }
              score = std::max(score, static_cast<double>(total_size) /
                                          static_cast<double>(std::max(
                                              base_level_size,
                                              level_max_bytes_[base_level_])));
            }
            if (score > 1.0) {
              score *= kScoreScale;
            }
          } else {
            score = std::max(score,
                             static_cast<double>(total_size) /
                                 mutable_cf_options.max_bytes_for_level_base);
          }
        }
      }
    } else {  // level > 0
      // Compute the ratio of current size to size limit.
      uint64_t level_bytes_no_compacting = 0;
      uint64_t level_total_bytes = 0;
      for (auto f : files_[level]) {
        level_total_bytes += f->fd.GetFileSize();
        if (!f->being_compacted) {
          level_bytes_no_compacting += f->compensated_file_size;
        }
      }
      if (!immutable_options.level_compaction_dynamic_level_bytes) {
        score = static_cast<double>(level_bytes_no_compacting) /
                MaxBytesForLevel(level);
      } else {
        if (level_bytes_no_compacting < MaxBytesForLevel(level)) {
          score = static_cast<double>(level_bytes_no_compacting) /
                  MaxBytesForLevel(level);
        } else {
          // If there are a large mount of data being compacted down to the
          // current level soon, we would de-prioritize compaction from
          // a level where the incoming data would be a large ratio. We do
          // it by dividing level size not by target level size, but
          // the target size and the incoming compaction bytes.
          score = static_cast<double>(level_bytes_no_compacting) /
                  (MaxBytesForLevel(level) + total_downcompact_bytes) *
                  kScoreScale;
        }
        // Drain unnecessary levels, but with lower priority compared to
        // when L0 is eligible. Only non-empty levels can be unnecessary.
        // If there is no unnecessary levels, lowest_unnecessary_level_ = -1.
        if (level_bytes_no_compacting > 0 &&
            level <= lowest_unnecessary_level_) {
          score = std::max(
              score, kScoreScale *
                         (1.001 + 0.001 * (lowest_unnecessary_level_ - level)));
        }
      }
      if (level <= lowest_unnecessary_level_) {
        total_downcompact_bytes += level_total_bytes;
      } else if (level_total_bytes > MaxBytesForLevel(level)) {
        total_downcompact_bytes +=
            static_cast<double>(level_total_bytes - MaxBytesForLevel(level));
      }
    }
    compaction_level_[level] = level;
    compaction_score_[level] = score;
  }

  // sort all the levels based on their score. Higher scores get listed
  // first. Use bubble sort because the number of entries are small.
  for (int i = 0; i < num_levels() - 2; i++) {
    for (int j = i + 1; j < num_levels() - 1; j++) {
      if (compaction_score_[i] < compaction_score_[j]) {
        double score = compaction_score_[i];
        int level = compaction_level_[i];
        compaction_score_[i] = compaction_score_[j];
        compaction_level_[i] = compaction_level_[j];
        compaction_score_[j] = score;
        compaction_level_[j] = level;
      }
    }
  }
  
  // ...
}

相信读者此时肯定有个疑问：什么时候调用这个更新分数的函数？

根据评分规则，我们易反推出flush或compaction肯定会更新分数，毕竟各层的sst数目发生了变化。事实也的确如此，回忆之前的flush流程分析

// db/flush_job.cc

Status FlushJob::Run(LogsWithPrepTracker* prep_tracker, FileMetaData* file_meta,
                     bool* switched_to_mempurge, bool* skipped_since_bg_error,
                     ErrorHandler* error_handler) {

  // ...
  if (mempurge_s.ok()) {
    base_->Unref();
    s = Status::OK();
  } else {
    // 将imm刷入磁盘
    s = WriteLevel0Table();
  }
  // ...
  if (!s.ok()) {
    cfd_->imm()->RollbackMemtableFlush(
        mems_, /*rollback_succeeding_memtables=*/!db_options_.atomic_flush);
  } else if (write_manifest_) {
    if (/*...*/) {
    // ...
    } else {
      // 正常的分支
      TEST_SYNC_POINT("FlushJob::InstallResults");
      // Replace immutable memtable with the generated Table

      // 更新version，计算分数(compaction中讲解)
      s = cfd_->imm()->TryInstallMemtableFlushResults(
              cfd_, mutable_cf_options_, mems_, prep_tracker, versions_, db_mutex_,
              meta_.fd.GetNumber(), &job_context_->memtables_to_free, db_directory_,
              log_buffer_, &committed_flush_jobs_info_,
              !(mempurge_s.ok()));
    }
  }
  // ...
}

当时我们留了一个引子，flush刷完imm后会调用MemTableList::TryInstallMemtableFlushResults，这个函数既要更新version，还有个作用便是重新计算分数。调用链如下(暂不对这条调用链进行详细的分析)

FlushJob::Run
  MemTableList::TryInstallMemtableFlushResults
    VersionSet::LogAndApply
      VersionSet::LogAndApply (重载)
        VersionSet::ProcessManifestWrites
          VersionSet::AppendVersion
            VersionStorageInfo::ComputeCompactionScore

添加新的version时会重新计算score

// db/version_set.cc

void VersionSet::AppendVersion(ColumnFamilyData* column_family_data,
                               Version* v) {
  // compute new compaction score
  v->storage_info()->ComputeCompactionScore(
      *column_family_data->ioptions(),
      *column_family_data->GetLatestMutableCFOptions());

  // ...
}

一套流程由此展开：对于一个新的数据库，每次flush将imm刷入 $L_0$ 后，先更新 $L_0$ 分数(此时只有这一层)，再执行DBImpl::InstallSuperVersionAndScheduleWork根据分数检查是否需要compaction。随着多次flush， $L_0$ 分数上涨。最终，在某一次执行InstallSuperVersionAndScheduleWork时，发现 $L_0$ 需要compaction，第一个compaction就诞生了！

此外，Compaction流程中也会更新分数(将在后文展示)，它执行DBImpl::InstallSuperVersionAndScheduleWork时就可能产生新的compaction。我们先回顾目前的函数调用顺序

小结

rocksdb根据分数判断一个cfd是否需要compaction。短暂回顾目前的函数流程

DBImpl::InstallSuperVersionAndScheduleWork
  DBImpl::SchedulePendingCompaction
    ColumnFamily::queued_for_compaction
    ColumnFamily::NeedsCompaction
      CompactionPicker::NeedsCompaction ===> LevelCompactionPicker::NeedsCompaction
        VersionStorageInfo::CompactionScore

分析好rocksdb如何判断一个cfd是否需要compaction，我们回到 DBImpl::InstallSuperVersionAndScheduleWork，接下来该调度compaction任务了

// db/db_impl/db_impl_compaction_flush.cc

void DBImpl::InstallSuperVersionAndScheduleWork(
    ColumnFamilyData* cfd, SuperVersionContext* sv_context,
    const MutableCFOptions& mutable_cf_options) {
  // 持有锁
  mutex_.AssertHeld();
  // ...

  SchedulePendingCompaction(cfd);
  MaybeScheduleFlushOrCompaction();

  // ...
}

这便是 DBImpl::MaybeScheduleFlushOrCompaction。之前我们分析了其中的的flush部分，现在来看compaction部分

DBImpl::MaybeScheduleFlushOrCompaction

// db/db_impl/db_impl_compaction_flush.cc

void DBImpl::MaybeScheduleFlushOrCompaction() {
  mutex_.AssertHeld();
  // ...
  // 从compaction队列里"取"任务执行
  while (bg_compaction_scheduled_ + bg_bottom_compaction_scheduled_ <
             bg_job_limits.max_compactions &&
         unscheduled_compactions_ > 0) {
    CompactionArg* ca = new CompactionArg;
    ca->db = this;
    ca->compaction_pri_ = Env::Priority::LOW;
    ca->prepicked_compaction = nullptr;
    bg_compaction_scheduled_++;
    unscheduled_compactions_--;
    env_->Schedule(&DBImpl::BGWorkCompaction, ca, Env::Priority::LOW, this,
                   &DBImpl::UnscheduleCompactionCallback);
  }
}

与flush类似，不会在此处调用 DBImpl::compaction_queue_的pop。之后的函数一定在线程池中执行。

DBImpl::BGWorkCompaction

// db/db_impl/db_impl_compaction_flush.cc

void DBImpl::BGWorkCompaction(void* arg) {
  CompactionArg ca = *(reinterpret_cast<CompactionArg*>(arg)); // 这里存在拷贝构造
  delete reinterpret_cast<CompactionArg*>(arg);
  IOSTATS_SET_THREAD_POOL_ID(Env::Priority::LOW);
  TEST_SYNC_POINT("DBImpl::BGWorkCompaction");
  auto prepicked_compaction =
      static_cast<PrepickedCompaction*>(ca.prepicked_compaction);
  static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction(
      prepicked_compaction, Env::Priority::LOW);
  delete prepicked_compaction;
}

再次提醒，从这个函数开始，一定都在线程池中执行！

DBImpl::BGWorkCompaction很简短，释放上面 new来的 CompactionArg* ca，获取其中的参数(prepicked_compaction是 nullptr)，然后调用 DBImpl::BackgrudCallCompaction。

DBImpl::BackgroundCallCompaction

// db/db_impl/db_impl_compaction_flush.cc

void DBImpl::BackgroundCallCompaction(PrepickedCompaction* prepicked_compaction,
                                      Env::Priority bg_thread_pri) {
  // 当前prepicked_compaction是nullptr
  // bg_thread_pri是Env::Priority::LOW
  
  // ...
  
  {
    // 需要持有锁
    InstrumentedMutexLock l(&mutex_);

    num_running_compactions_++;

    std::unique_ptr<std::list<uint64_t>::iterator>
        pending_outputs_inserted_elem(new std::list<uint64_t>::iterator(
            CaptureCurrentFileNumberInPendingOutputs()));

    assert((bg_thread_pri == Env::Priority::BOTTOM &&
            bg_bottom_compaction_scheduled_) ||
           (bg_thread_pri == Env::Priority::LOW && bg_compaction_scheduled_));
    Status s = BackgroundCompaction(&made_progress, &job_context, &log_buffer,
                                    prepicked_compaction, bg_thread_pri);
    TEST_SYNC_POINT("BackgroundCallCompaction:1");
    if (s.IsBusy()) {
      bg_cv_.SignalAll();  // In case a waiter can proceed despite the error
      mutex_.Unlock();
      immutable_db_options_.clock->SleepForMicroseconds(
          10000);  // prevent hot loop
      mutex_.Lock();
    } else if (!s.ok() && !s.IsShutdownInProgress() &&
               !s.IsManualCompactionPaused() && !s.IsColumnFamilyDropped()) {
      // Wait a little bit before retrying background compaction in
      // case this is an environmental problem and we do not want to
      // chew up resources for failed compactions for the duration of
      // the problem.
      
      // BackgroundCompaction里可能返回Status::CompactionTooLarge
      // 走这条分支
      // ...
    } else if (s.IsManualCompactionPaused()) {
      // ...
    }

    ReleaseFileNumberFromPendingOutputs(pending_outputs_inserted_elem);

    // If compaction failed, we want to delete all temporary files that we
    // might have created (they might not be all recorded in job_context in
    // case of a failure). Thus, we force full scan in FindObsoleteFiles()
    FindObsoleteFiles(&job_context, !s.ok() && !s.IsShutdownInProgress() &&
                                        !s.IsManualCompactionPaused() &&
                                        !s.IsColumnFamilyDropped() &&
                                        !s.IsBusy());
    TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:FoundObsoleteFiles");

    // delete unnecessary files if any, this is done outside the mutex
    if (job_context.HaveSomethingToClean() ||
        job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
      // ...
    }

    assert(num_running_compactions_ > 0);
    num_running_compactions_--;

    if (bg_thread_pri == Env::Priority::LOW) {
      // 当前流程走该分支
      bg_compaction_scheduled_--;
    } else {
      assert(bg_thread_pri == Env::Priority::BOTTOM);
      bg_bottom_compaction_scheduled_--;
    }

    // See if there's more work to be done
    MaybeScheduleFlushOrCompaction();

    // 当前流程下prepicked_compaction是nullptr
    if (prepicked_compaction != nullptr &&
        prepicked_compaction->task_token != nullptr) {
      // ...
    }

    // ...
  }
}

调用 DBImpl::BackgroundCompaction，参数情况

形参	实参	值
`madeProgrss`	`&made_progress`	`false`
`prepicked_compaction`	`prepicked_compaction`	`nullptr`
`thread_pri`	`bg_thread_pri`	`Env::Priority::LOW`

DBImpl::BackgroundCompaction

// db/db_impl/db_impl_compaction_flush.cc

Status DBImpl::BackgroundCompaction(bool* made_progress,
                                    JobContext* job_context,
                                    LogBuffer* log_buffer,
                                    PrepickedCompaction* prepicked_compaction,
                                    Env::Priority thread_pri) {
  // prepicked_compaction是nullptr

  // 当前流程下是nullptr
  ManualCompactionState* manual_compaction =
      prepicked_compaction == nullptr
          ? nullptr
          : prepicked_compaction->manual_compaction_state;
  *made_progress = false;
  mutex_.AssertHeld();
  TEST_SYNC_POINT("DBImpl::BackgroundCompaction:Start");

  const ReadOptions read_options(Env::IOActivity::kCompaction);
  const WriteOptions write_options(Env::IOActivity::kCompaction);

  // 当前流程下是false
  bool is_manual = (manual_compaction != nullptr); 
  std::unique_ptr<Compaction> c;
  if (prepicked_compaction != nullptr &&
      prepicked_compaction->compaction != nullptr) {
    c.reset(prepicked_compaction->compaction);
  }
  // 当前流程下是false
  bool is_prepicked = is_manual || c; 

  // (manual_compaction->in_progress == false);
  bool trivial_move_disallowed =
      is_manual && manual_compaction->disallow_trivial_move;

  // 因为prepicked_compaction是nullptr, 所以
  // manual_compaction=nullptr
  // is_manual=false
  // c=nullptr
  // is_prepicked=false
  // trivial_move_disallowed=false

  CompactionJobStats compaction_job_stats;
  Status status;
  
  // ...

  TEST_SYNC_POINT("DBImpl::BackgroundCompaction:InProgress");

  // 用来控制compaction频率
  std::unique_ptr<TaskLimiterToken> task_token;

  bool sfm_reserved_compact_space = false;
  if (is_manual) {
   // 当前流程下是false略过...
  } else if (!is_prepicked && !compaction_queue_.empty()) {
    if (HasExclusiveManualCompaction()) {
      // Can't compact right now, but try again later
      TEST_SYNC_POINT("DBImpl::BackgroundCompaction()::Conflict");

      // Stay in the compaction queue.
      unscheduled_compactions_++;

      return Status::OK();
    }

    // 从compaction_queue_取出需要compaction的column family
    auto cfd = PickCompactionFromQueue(&task_token, log_buffer);

    // ...
    
    auto* mutable_cf_options = cfd->GetLatestMutableCFOptions();
    if (!mutable_cf_options->disable_auto_compactions && !cfd->IsDropped()) {
      // ...
  
      // 选择参与compaction的文件，给指针c赋值
      c.reset(cfd->PickCompaction(*mutable_cf_options, mutable_db_options_,
                                  log_buffer));

      TEST_SYNC_POINT("DBImpl::BackgroundCompaction():AfterPickCompaction");

      if (c != nullptr) {
        bool enough_room = EnoughRoomForCompaction(
            cfd, *(c->inputs()), &sfm_reserved_compact_space, log_buffer);

        if (!enough_room) {
          // Then don't do the compaction
          // ...
          // Don't need to sleep here, because BackgroundCallCompaction
          // will sleep if !s.ok()
          status = Status::CompactionTooLarge();
        } else {
          // update statistics
          // ...

          // 当这些files被挑选后，层的分数就发生了变化
          // 检查该column family是否还需要compaction
          // 如果是，则加入compaction_queue_队列
          // 并执行MaybeScheduleFlushOrCompaction
      
          // ...
        }
      }
    }
  }

  IOStatus io_s;
  bool compaction_released = false;
  
  // compaction分几种
  // 可直接删除input的deletion compaction
  // 直接将输入文件移动到下一层的trivial move compaction
  // forward compaction
  // 常规compaction
  if (!c) {
    // Nothing to do
    ROCKS_LOG_BUFFER(log_buffer, "Compaction nothing to do");
  } else if (c->deletion_compaction()) {
    // ...
  } else if (!trivial_move_disallowed && c->IsTrivialMove()) {
    // ...
  } else if (!is_prepicked && c->output_level() > 0 && /*...*/) {
    // Forward compactions involving last level to the bottom pool if it exists,
    // such that compactions unlikely to contribute to write stalls can be
    // delayed or deprioritized.
    // ...
  } else {
    // ...

    // 创建CompactionJob
    CompactionJob compaction_job(
        job_context->job_id, c.get(), immutable_db_options_,
        mutable_db_options_, file_options_for_compaction_, versions_.get(),
        &shutting_down_, log_buffer, directories_.GetDbDir(),
        GetDataDir(c->column_family_data(), c->output_path_id()),
        GetDataDir(c->column_family_data(), 0), stats_, &mutex_,
        &error_handler_, snapshot_seqs, earliest_write_conflict_snapshot,
        snapshot_checker, job_context, table_cache_, &event_logger_,
        c->mutable_cf_options()->paranoid_file_checks,
        c->mutable_cf_options()->report_bg_io_stats, dbname_,
        &compaction_job_stats, thread_pri, io_tracer_,
        is_manual ? manual_compaction->canceled
                  : kManualCompactionCanceledFalse_,
        db_id_, db_session_id_, c->column_family_data()->GetFullHistoryTsLow(),
        c->trim_ts(), &blob_callback_, &bg_compaction_scheduled_,
        &bg_bottom_compaction_scheduled_);
  
    // 相较于flush，compaction需要调用CompactionJob::Prepare
    compaction_job.Prepare();

    NotifyOnCompactionBegin(c->column_family_data(), c.get(), status,
                            compaction_job_stats, job_context->job_id);
  
    // 对于compaction而言, 此处就释放了锁！  
    mutex_.Unlock();
    TEST_SYNC_POINT_CALLBACK(
        "DBImpl::BackgroundCompaction:NonTrivial:BeforeRun", nullptr);
    // Should handle error?
  
    // 运行compaction job
    compaction_job.Run().PermitUncheckedError();
    TEST_SYNC_POINT("DBImpl::BackgroundCompaction:NonTrivial:AfterRun");

    // 再次获取锁
    mutex_.Lock();

    status =
        compaction_job.Install(*c->mutable_cf_options(), &compaction_released);
    io_s = compaction_job.io_status();
    if (status.ok()) {
      InstallSuperVersionAndScheduleWork(c->column_family_data(),
                                         &job_context->superversion_contexts[0],
                                         *c->mutable_cf_options());
    }
    *made_progress = true;
    TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:AfterCompaction",
                             c->column_family_data());
  }

  if (status.ok() && !io_s.ok()) {
    status = io_s;
  } else {
    io_s.PermitUncheckedError();
  }

  if (c != nullptr) {
    if (!compaction_released) {
      c->ReleaseCompactionFiles(status);
    } else {
      // ...
    }

    *made_progress = true;

    // Need to make sure SstFileManager does its bookkeeping
    // DBImpl::FlushMemTableToOutputFile里有类似的流程
    auto sfm = static_cast<SstFileManagerImpl*>(
        immutable_db_options_.sst_file_manager.get());
    if (sfm && sfm_reserved_compact_space) {
      sfm->OnCompactionCompletion(c.get());
    }

    NotifyOnCompactionCompleted(c->column_family_data(), c.get(), status,
                                compaction_job_stats, job_context->job_id);
  }

  if (status.ok() || status.IsCompactionTooLarge() ||
      status.IsManualCompactionPaused()) {
    // compaction没问题走该分支
    // Done
  } else if (status.IsColumnFamilyDropped() || status.IsShutdownInProgress()) {
    // Ignore compaction errors found during shutting down
  } else {
    // 其它问题走该分支
  }
  // this will unref its input_version and column_family_data
  c.reset();

  // ...
  TEST_SYNC_POINT("DBImpl::BackgroundCompaction:Finish");
  return status;
}

该函数的核心逻辑是

先从 DBImpl::compaction_queue_取出需要compaction的column family
再调用 ColumnFamilyData::PickCompaction选择需要compaction的文件
然后生成 CompactionJob，并执行

重点来了，之前流程只是选择了需要compaction的column family，如何确定层和sst文件呢？ColumnFamilyData::PickCompaction负责这块。注意，整个挑选sst的过程被锁保护！

确定层和选择sst文件

官方wikiChoose Level Compaction Files详细总结了compaction如何确定层以及如何选择sst文件。建议先阅读它。

接下来我们从源码来理解这个过程

ColumnFamilyData::PickCompaction

// db/column_family.cc

Compaction* ColumnFamilyData::PickCompaction(
    const MutableCFOptions& mutable_options,
    const MutableDBOptions& mutable_db_options, LogBuffer* log_buffer) {
  auto* result = compaction_picker_->PickCompaction(
      GetName(), mutable_options, mutable_db_options, current_->storage_info(),
      log_buffer);
  if (result != nullptr) {
    result->FinalizeInputInfo(current_);
  }
  return result;
}

可见，每个column family有自己的 CompactionPicker，这个picker负责该column family的所有compaction挑选任务。CompactionPicker::PickCompaction是一个纯虚函数，道理也很简单，因为compaction策略不止一种。我们现在分析的是level策略，因此看level的实现

LevelCompactionPicker::PickCompaction

// db/compaction/compaction_picker_level.cc

Compaction* LevelCompactionPicker::PickCompaction(
    const std::string& cf_name, const MutableCFOptions& mutable_cf_options,
    const MutableDBOptions& mutable_db_options, VersionStorageInfo* vstorage,
    LogBuffer* log_buffer) {
  LevelCompactionBuilder builder(cf_name, vstorage, this, log_buffer,
                                 mutable_cf_options, ioptions_,
                                 mutable_db_options);
  return builder.PickCompaction();
}

每调用一次，这个函数就会构造一个新的 LevelCompactionBuilder，然后调用 LevelCompactionBuilder::PickCompaction。为什么每次要构建新的builder，因为传入的 vstorage信息不同！又因为此处传入了 this指针，所以 LevelCompactionBuilder中也能访问 ColumnFamilyData::compaction_picker_。

LevelCompactionBuilder::PickCompaction

// db/compaction/compaction_picker_level.cc

Compaction* LevelCompactionBuilder::PickCompaction() {
  // Pick up the first file to start compaction. It may have been extended
  // to a clean cut.
  SetupInitialFiles();
  if (start_level_inputs_.empty()) {
    return nullptr;
  }
  assert(start_level_ >= 0 && output_level_ >= 0);

  // If it is a L0 -> base level compaction, we need to set up other L0
  // files if needed.
  if (!SetupOtherL0FilesIfNeeded()) {
    return nullptr;
  }

  // Pick files in the output level and expand more files in the start level
  // if needed.
  if (!SetupOtherInputsIfNeeded()) {
    return nullptr;
  }

  // Form a compaction object containing the files we picked.
  Compaction* c = GetCompaction();

  TEST_SYNC_POINT_CALLBACK("LevelCompactionPicker::PickCompaction:Return", c);

  return c;
}

如注释所言，该函数逻辑非常清楚：假设此次compaction发起层是 $L_i$ ，输出层是 $L_o$

调用 LevelCompactionBuilder::SetupInitialFiles选择 $L_i$ 中的sst，它们存于start_level_inputs_
如果start_level_inputs_为空，返回nullptr，表示无法产生一个compaction
如果 $L_i$ 是 $L_0$ ，则还需要选择其它的sst，因为 $L_0$ 中sst有重叠
调用 LevelCompactionBuilder::SetupOtherInputsIfNeeded选择 $L_o$ 的sst
调用 LevelCompactionBuilder::GetCompaction构建包含这些sst的 Compaction

事实上，目前我们压根不知道发起层 $L_i$ 和输出层 $L_o$ 是哪两层！sst的选择依赖于层的选择，这些都由 LevelCompactionBuilder::SetupInitialFiles完成

LevelCompactionBuilder::SetupInitialFiles

// db/compaction/compaction_picker_level.cc

void LevelCompactionBuilder::SetupInitialFiles() {
  // Find the compactions by size on all levels.
  bool skipped_l0_to_base = false;
  for (int i = 0; i < compaction_picker_->NumberLevels() - 1; i++) {
    start_level_score_ = vstorage_->CompactionScore(i);
    start_level_ = vstorage_->CompactionScoreLevel(i);
    assert(i == 0 || start_level_score_ <= vstorage_->CompactionScore(i - 1));
    if (start_level_score_ >= 1) {
      if (skipped_l0_to_base && start_level_ == vstorage_->base_level()) {
        // If L0->base_level compaction is pending, don't schedule further
        // compaction from base level. Otherwise L0->base_level compaction
        // may starve.
        continue;
      }
      output_level_ =
          (start_level_ == 0) ? vstorage_->base_level() : start_level_ + 1;
      bool picked_file_to_compact = PickFileToCompact();
      TEST_SYNC_POINT_CALLBACK("PostPickFileToCompact",
                               &picked_file_to_compact);
      if (picked_file_to_compact) {
        // found the compaction!
        if (start_level_ == 0) {
          // L0 score = `num L0 files` / `level0_file_num_compaction_trigger`
          compaction_reason_ = CompactionReason::kLevelL0FilesNum;
        } else {
          // L1+ score = `Level files size` / `MaxBytesForLevel`
          compaction_reason_ = CompactionReason::kLevelMaxLevelSize;
        }
        break;
      } else {
        // didn't find the compaction, clear the inputs
        start_level_inputs_.clear();
        if (start_level_ == 0) {
          skipped_l0_to_base = true;
          // L0->base_level may be blocked due to ongoing L0->base_level
          // compactions. It may also be blocked by an ongoing compaction from
          // base_level downwards.
          //
          // In these cases, to reduce L0 file count and thus reduce likelihood
          // of write stalls, we can attempt compacting a span of files within
          // L0.
          if (PickIntraL0Compaction()) {
            output_level_ = 0;
            compaction_reason_ = CompactionReason::kLevelL0FilesNum;
            break;
          }
        }
      }
    } else {
      // Compaction scores are sorted in descending order, no further scores
      // will be >= 1.
      break;
    }
  }
  if (!start_level_inputs_.empty()) {
    return;
  }

  // if we didn't find a compaction, check if there are any files marked for
  // compaction
  // 暂时忽略其它情况
}

整段代码的逻辑是：

根据score由高到低遍历层
- 输入层 $L_i$ 为 start_level_，分数是 start_level_score_。如果 start_level_score_小于1直接break
- 确定输出层 $L_o$ output_level_，如果 $L_i$ 是 $L_0$ ，则输出层 $L_o$ 是 base_level()，否则 $L_{i+1}$ 即下一层
- 调用 LevelCompactionBuilder::PickFileToCompact挑选出输入文件 start_level_inputs_：
  - 如果函数返回false，表明无法选出sst，当输入层是 $L_0$ 时，调用 PickIntraL0Compaction::PickIntraL0Compaction尝试 $L_0$ 内的compaction
  - 如果函数返回true，表明能选出sst，那么任务就完成了，break

由此可见，compaction挑选分数最高(且大于1)的层，这回答了当有多层

LevelCompactionBuilder::PickFileToCompact

// db/compaction/compaction_picker_level.cc

bool LevelCompactionBuilder::PickFileToCompact() {
  // level 0 files are overlapping. So we cannot pick more
  // than one concurrent compactions at this level. This
  // could be made better by looking at key-ranges that are
  // being compacted at level 0.
  if (start_level_ == 0 &&
      !compaction_picker_->level0_compactions_in_progress()->empty()) {
    TEST_SYNC_POINT("LevelCompactionPicker::PickCompactionBySize:0");
    return false;
  }

  start_level_inputs_.clear();
  start_level_inputs_.level = start_level_;

  assert(start_level_ >= 0);

  if (TryPickL0TrivialMove()) {
    return true;
  }

  const std::vector<FileMetaData*>& level_files =
      vstorage_->LevelFiles(start_level_);

  // Pick the file with the highest score in this level that is not already
  // being compacted.
  const std::vector<int>& file_scores =
      vstorage_->FilesByCompactionPri(start_level_);

  unsigned int cmp_idx;
  for (cmp_idx = vstorage_->NextCompactionIndex(start_level_);
       cmp_idx < file_scores.size(); cmp_idx++) {
    int index = file_scores[cmp_idx];
    auto* f = level_files[index];

    // do not pick a file to compact if it is being compacted
    // from n-1 level.
    if (f->being_compacted) {
      if (ioptions_.compaction_pri == kRoundRobin) {
        // TODO(zichen): this file may be involved in one compaction from
        // an upper level, cannot advance the cursor for round-robin policy.
        // Currently, we do not pick any file to compact in this case. We
        // should fix this later to ensure a compaction is picked but the
        // cursor shall not be advanced.
        return false;
      }
      continue;
    }

    start_level_inputs_.files.push_back(f);
    if (!compaction_picker_->ExpandInputsToCleanCut(cf_name_, vstorage_,
                                                    &start_level_inputs_) ||
        compaction_picker_->FilesRangeOverlapWithCompaction(
            {start_level_inputs_}, output_level_,
            Compaction::EvaluatePenultimateLevel(
                vstorage_, ioptions_, start_level_, output_level_))) {
      // A locked (pending compaction) input-level file was pulled in due to
      // user-key overlap.
      start_level_inputs_.clear();

      if (ioptions_.compaction_pri == kRoundRobin) {
        return false;
      }
      continue;
    }

    // Now that input level is fully expanded, we check whether any output
    // files are locked due to pending compaction.
    //
    // Note we rely on ExpandInputsToCleanCut() to tell us whether any output-
    // level files are locked, not just the extra ones pulled in for user-key
    // overlap.
    InternalKey smallest, largest;
    compaction_picker_->GetRange(start_level_inputs_, &smallest, &largest);
    CompactionInputFiles output_level_inputs;
    output_level_inputs.level = output_level_;
    vstorage_->GetOverlappingInputs(output_level_, &smallest, &largest,
                                    &output_level_inputs.files);
    if (output_level_inputs.empty()) {
      if (start_level_ > 0 &&
          TryExtendNonL0TrivialMove(index,
                                    ioptions_.compaction_pri ==
                                        kRoundRobin /* only_expand_right */)) {
        break;
      }
    } else {
      if (!compaction_picker_->ExpandInputsToCleanCut(cf_name_, vstorage_,
                                                      &output_level_inputs)) {
        start_level_inputs_.clear();
        if (ioptions_.compaction_pri == kRoundRobin) {
          return false;
        }
        continue;
      }
    }

    base_index_ = index;
    break;
  }

  // store where to start the iteration in the next call to PickCompaction
  if (ioptions_.compaction_pri != kRoundRobin) {
    vstorage_->SetNextCompactionIndex(start_level_, cmp_idx);
  }
  return start_level_inputs_.size() > 0;
}

这段代码的逻辑为：

如果 $L_i$ 是 $L_0$ ，目前有 $base_level L_0 \to \text{base\_level}$ 的compaction，那么终止挑选。 $L_0$ 的compaction无法并发执行(存疑)
在 $L_i$ 中，根据compaction priority分数由高到低依次检查sst，for循环只用挑选一个:
- 如果该sst正在被compact，检查下一个continue
- 将它放进start_level_inputs_
- 调用CompactionPicker::ExpandInputsToCleanCut对该sst执行clean cut操作，从 $L_i$ 中选出更多相连的sst放进start_level_inputs_，如果这些相连的sst有正在被compact，清空start_level_inputs_，continue
- 调用CompactionPicker::FilesRangeOverlapWithCompaction对clean cut后的start_level_inputs_做检查。如果不合格，则清空start_level_inputs_，continue
- 从 $L_o$ 中找到所有和start_level_inputs_存在重叠的sst，把它们放进output_level_inputs
- 调用CompactionPicker::ExpandInputsToCleanCut对这些sst进行clean cut，从 $L_o$ 中选出更多相连的sst放进output_level_inputs，如果这些相连的sst有正在被compact，清空start_level_inputs_，continue
- 将该sst序号记录于base_index_，break
如果start_level_inputs_中有sst，返回true

这里有必要说clean cut是什么。回到LevelCompactionBuilder::PickCompaction。注意，output_level_inputs只是一个局部变量，换言之， $L_o$ 的sst挑选并不在此。

// db/compaction/compaction_picker_level.cc

Compaction* LevelCompactionBuilder::PickCompaction() {
  SetupInitialFiles();
  if (start_level_inputs_.empty()) {
    return nullptr;
  }
  assert(start_level_ >= 0 && output_level_ >= 0);

  // If it is a L0 -> base level compaction, we need to set up other L0
  // files if needed.
  if (!SetupOtherL0FilesIfNeeded()) {
    return nullptr;
  }
  
  // ...

  return c;
}

当前我们已确定了 $L_i$ ，并从 $L_i$ 中选好了sst，该执行LevelCompactionBuilder::SetupOtherL0FilesIfNeeded了

LevelCompactionBuilder::SetupOtherL0FilesIfNeeded

// db/compaction/compaction_picker_level.cc

bool LevelCompactionBuilder::SetupOtherL0FilesIfNeeded() {
  if (start_level_ == 0 && output_level_ != 0 && !is_l0_trivial_move_) {
    return compaction_picker_->GetOverlappingL0Files(
        vstorage_, &start_level_inputs_, output_level_, &parent_index_);
  }
  return true;
}

这个函数调用CompactionPicker::GetOverlappingL0Files对start_level_inputs_进行拓展，其间会检查拓展后的sst们有无正在被compact，如果有则返回false。

再回到LevelCompactionBuilder::PickCompaction，

Compaction* LevelCompactionBuilder::PickCompaction() {
  // ...

  // If it is a L0 -> base level compaction, we need to set up other L0
  // files if needed.
  if (!SetupOtherL0FilesIfNeeded()) {
    return nullptr;
  }

  // Pick files in the output level and expand more files in the start level
  // if needed.
  if (!SetupOtherInputsIfNeeded()) {
    return nullptr;
  }
  
  // ...

  return c;
}

下一步是LevelCompactionBuilder::SetupOtherInputsIfNeeded

LevelCompactionBuilder::SetupOtherInputsIfNeeded

暂且不分析

至此，我们完成了 $L_i$ 和 $L_o$ 的确定，把来自 $L_i$ 的输入start_level_inputs_放进compaction_inputs_，再放来自 $L_o$ 的输入output_level_inputs_，现在所有参与本次compaction的sst都在compaction_inputs_里。最后，LevelCompactionBuilder::PickCompaction里再调用LevelCompactionBuilder::GetCompaction把这些资源转移到Compaction，返回它。

// db/compaction/compaction_picker_level.cc

Compaction* LevelCompactionPicker::PickCompaction(
    const std::string& cf_name, const MutableCFOptions& mutable_cf_options,
    const MutableDBOptions& mutable_db_options, VersionStorageInfo* vstorage,
    LogBuffer* log_buffer) {
  LevelCompactionBuilder builder(cf_name, vstorage, this, log_buffer,
                                 mutable_cf_options, ioptions_,
                                 mutable_db_options);
  return builder.PickCompaction();
}

LevelCompactionPicker::PickCompaction返回这个comapction指针给调用方ColumnFamilyData::PickCompaction

// db/column_family.cc

Compaction* ColumnFamilyData::PickCompaction(
    const MutableCFOptions& mutable_options,
    const MutableDBOptions& mutable_db_options, LogBuffer* log_buffer) {
  auto* result = compaction_picker_->PickCompaction(
      GetName(), mutable_options, mutable_db_options, current_->storage_info(),
      log_buffer);
  if (result != nullptr) {
    result->FinalizeInputInfo(current_);
  }
  return result;
}

最后返回给

// db/db_impl/db_impl_compaction_flush.cc

Status DBImpl::BackgroundCompaction(bool* made_progress,
                                    JobContext* job_context,
                                    LogBuffer* log_buffer,
                                    PrepickedCompaction* prepicked_compaction,
                                    Env::Priority thread_pri) {
                                      //...

   auto cfd = PickCompactionFromQueue(&task_token, log_buffer);
   // ...
    if (!mutable_cf_options->disable_auto_compactions && !cfd->IsDropped()) {
      // ...
      c.reset(cfd->PickCompaction(*mutable_cf_options, mutable_db_options_,
                                  log_buffer));   
    }
    //...                 
}

小结

由于确定层和挑选sst需要持有锁，所以同一个sst不可能被两个compaction任务同时选取，sst视图的一致性得到了初步保证。对于层的确定，
选择分数最高(得大于1)的层；对于sst的选择，先选择优先级最高的一个sst再clean cut扩展。回顾目前的函数流程

DBImpl::InstallSuperVersionAndScheduleWork
  DBImpl::SchedulePendingCompaction #判断cfd是否需要compaction
    ColumnFamily::queued_for_compaction
    ColumnFamily::NeedsCompaction
      CompactionPicker::NeedsCompaction ===> LevelCompactionPicker::NeedsCompaction
        VersionStorageInfo::CompactionScore

  DBImpl::MaybeScheduleFlushOrCompaction
    DBImpl::BGWorkCompaction
      DBImpl::BackgroundCallCompaction
        DBImpl::BackgroundCompaction
          ColumnFamilyData::PickCompaction ===> LevelCompactionPicker::PickCompaction
            LevelCompactionBuilder::PickCompaction
              LevelCompactionBuilder::SetupInitialFiles # 确定层和起始文件
              LevelCompactionBuilder::SetupOtherL0FilesIfNeeded
              LevelCompactionBuilder::SetupOtherInputsIfNeeded
              LevelCompactionBuilder::GetCompaction # 打包选择的sst生成一个Compaction对象

接下来，我们回到DBImpl::BackgroundCompaction，当生成CompactionJob后，调用其Prepare方法。这与flush不一样，FlushJob中没有同名方法。因为rocksdb支持将一个compaction任务划分成多个子任务，使用多线程并发执行它们。划分的工作正是由CompactionJob::Prepare完成。