文章目录
引言
谈到comapction,自然而然有几个问题我们很想知道
-
每个column family有属于自己的lsm,那么对于一个cfd而言,它什么时候需要compaction?
-
如果此时不止一层需要compaction,该如何选定层?
-
选定起始层后,如何挑选参与的sst?
-
如何从选定的sst读取kv,合并排序后再写入新的sst?
本篇文章将顺着这个脉络依次讲解。
回顾前文分析Flush流程,在 DBImpl::FlushMemTableToOutputFile中,flush执行后(FlushJob::Run)会调用 DBImpl::InstallSuperVersionAndScheduleWork更新super version
// db/db_impl/db_impl_compaction_flush.cc
Status DBImpl::FlushMemTableToOutputFile(
ColumnFamilyData* cfd, const MutableCFOptions& mutable_cf_options,
bool* made_progress, JobContext* job_context, FlushReason flush_reason,
SuperVersionContext* superversion_context,
std::vector<SequenceNumber>& snapshot_seqs,
SequenceNumber earliest_write_conflict_snapshot,
SnapshotChecker* snapshot_checker, LogBuffer* log_buffer,
Env::Priority thread_pri) {
mutex_.AssertHeld(); // 仍然持有锁
// ...
// 生成flush任务
FlushJob flush_job(
dbname_, cfd, immutable_db_options_, mutable_cf_options, ...);
//...
if (s.ok()) {
// 运行flush任务
s = flush_job.Run(&logs_with_prep_tracker_, &file_meta,
&switched_to_mempurge, &skip_set_bg_error,
&error_handler_);
need_cancel = false;
}
// ...
if (s.ok()) {
// 更新super version,调度可能的compaction
InstallSuperVersionAndScheduleWork(cfd, superversion_context,
mutable_cf_options);
// ...
}
// ...
return s;
}
除了更新super version,它还有个重要作用便是触发、调度可能的compaction。毕竟随着flush的进行,L0的sst数目可能达到了触发compaction的阈值,该阈值由 Option::level0_file_num_compaction_trigger控制。并且,此函数不只在flush后被调用,还会在compaction后被调用——一个compaction结束后又调度新的compaction。一个原因是,此次compaction可能会让下层大小超了,还需要接着compaction,正所谓cascade compaction。由此可见,对于由flush或compaction使下层大小超过了额度而产生的新compaction,源头都可以是DBImpl::InstallSuperVersionAndScheduleWork。注意此处用词“可以是”,因为compaction能手动触发。
再次注意,这个函数是在线程池里被调用,即super version的更新是在flush或compaction线程里,而不是主线程。
函数调用顺序
DBImpl::InstallSuperVersionAndScheduleWork
// db/db_impl/db_impl_compaction_flush.cc
void DBImpl::InstallSuperVersionAndScheduleWork(
ColumnFamilyData* cfd, SuperVersionContext* sv_context,
const MutableCFOptions& mutable_cf_options) {
// 持有锁
mutex_.AssertHeld();
// Update max_total_in_memory_state_
size_t old_memtable_size = 0;
auto* old_sv = cfd->GetSuperVersion();
if (old_sv) {
old_memtable_size = old_sv->mutable_cf_options.write_buffer_size *
old_sv->mutable_cf_options.max_write_buffer_number;
}
// this branch is unlikely to step in
if (UNLIKELY(sv_context->new_superversion == nullptr)) {
sv_context->NewSuperVersion();
}
cfd->InstallSuperVersion(sv_context, mutable_cf_options);
// There may be a small data race here. The snapshot tricking bottommost
// compaction may already be released here. But assuming there will always be
// newer snapshot created and released frequently, the compaction will be
// triggered soon anyway.
// 这段不懂
// ...
// Whenever we install new SuperVersion, we might need to issue new flushes or
// compactions.
SchedulePendingCompaction(cfd);
MaybeScheduleFlushOrCompaction();
// Update max_total_in_memory_state_
max_total_in_memory_state_ = max_total_in_memory_state_ - old_memtable_size +
mutable_cf_options.write_buffer_size *
mutable_cf_options.max_write_buffer_number;
}
这段代码的逻辑是:
- 检查是否持有锁
- 调用
ColumnFamilyData::InstallSuperVersion给传入的cfd安装新的super version - 调用
DBImpl::SchedulePendingCompaction检查传入的cfd是否需要compaction,如果是,则将其加入DBImpl::compaction_queue_ - 调用
DBImpl::MaybeScheduleFlushOrCompaction对新的compaction请求进行调度,由此形成了闭环!当然,这也可能会调度新的flush。
再次提醒,接下来的函数都是在持有 DBImpl::mutex_完成。关于super version如何更新,之后会单独开一篇文章来讲解。
判断compaction
DBImpl::SchedulePendingCompaction
// db/db_impl/db_impl_compaction_flush.cc
void DBImpl::SchedulePendingCompaction(ColumnFamilyData* cfd) {
mutex_.AssertHeld();
if (reject_new_background_jobs_) {
return;
}
if (!cfd->queued_for_compaction() && cfd->NeedsCompaction()) {
AddToCompactionQueue(cfd);
++unscheduled_compactions_;
}
}
该函数做了三件事:
- 调用
ColumnFamilyData::queued_for_compaction检查传入的column famliy是否已经加入DBImpl::compaction_queue_。以此避免重复加入 - 调用
ColumnFamilyData::NeedsCompaction检查此column family是否需要compaction - 如果上述两个条件都满足,则调用
DBImpl::AddToCompactionQueue将此column family加入DBImpl::compaction_queue_,并让DBImpl::unscheduled_comapctions_加一
从命名上看,该函数似乎和 DBImpl::SchedulePendingFlush对等,其实不然。DBImpl::compaction_queue_存放的是column family,而 DBImpl::flush_queue_存放的是 FlushRequest,而一个 FlushRequest可以包含多个column family。
判断column family是否需要compaction,这通过 ColumnFamilyData::NeedsCompaction完成
ColumnFamilyData::NeedsCompaction
// db/column_family.cc
bool ColumnFamilyData::NeedsCompaction() const {
return !mutable_cf_options_.disable_auto_compactions &&
compaction_picker_->NeedsCompaction(current_->storage_info());
}
实际上通过指针调用了 CompactionPicker::NeedsCompaction,这又是一个纯虚函数。由于Rocksdb自带三种compaction策略,因此这个函数在rocksdb里有三种实现。我们先看默认的Level方式。
LevelCompactionPicker::NeedsCompaction
// db/compaction/compaction_picker_level.cc
bool LevelCompactionPicker::NeedsCompaction(
const VersionStorageInfo* vstorage) const {
if (!vstorage->ExpiredTtlFiles().empty()) {
return true;
}
if (!vstorage->FilesMarkedForPeriodicCompaction().empty()) {
return true;
}
if (!vstorage->BottommostFilesMarkedForCompaction().empty()) {
return true;
}
if (!vstorage->FilesMarkedForCompaction().empty()) {
return true;
}
if (!vstorage->FilesMarkedForForcedBlobGC().empty()) {
return true;
}
for (int i = 0; i <= vstorage->MaxInputLevel(); i++) {
if (vstorage->CompactionScore(i) >= 1) {
return true;
}
}
return false;
}
先暂时忽略其中的if,只关注for循环,查看VersionStorageInfo::CompactionScore
VersionStorageInfo::CompactionScore
// db/version_set.h
class VersionStorageInfo {
// ...
// Return idx'th highest score
double CompactionScore(int idx) const { return compaction_score_[idx]; }
// ...
}
正如注释所言,for循环是在从高到低查看分数,只要最大分数大于等于1,就认为需要compaction。所以,VersionStorageInfo::compaction_score_是个从大到小的有序数组。这个数组在VersionStorageInfo::ComputeCompactionScore中被修改
VersionStorageInfo::ComputeCompactionScore
// db/version_set.cc
void VersionStorageInfo::ComputeCompactionScore(
const ImmutableOptions& immutable_options,
const MutableCFOptions& mutable_cf_options) {
double total_downcompact_bytes = 0.0;
const double kScoreScale = 10.0;
int max_output_level = MaxOutputLevel(immutable_options.allow_ingest_behind);
for (int level = 0; level <= MaxInputLevel(); level++) {
double score;
if (level == 0) {
int num_sorted_runs = 0;
uint64_t total_size = 0;
for (auto* f : files_[level]) {
total_downcompact_bytes += static_cast<double>(f->fd.GetFileSize());
if (!f->being_compacted) {
total_size += f->compensated_file_size;
num_sorted_runs++;
}
}
if (compaction_style_ == kCompactionStyleUniversal) {
// ...
}
if (compaction_style_ == kCompactionStyleFIFO) {
// ...
} else {
score = static_cast<double>(num_sorted_runs) /
mutable_cf_options.level0_file_num_compaction_trigger;
if (compaction_style_ == kCompactionStyleLevel && num_levels() > 1) {
if (immutable_options.level_compaction_dynamic_level_bytes) {
if (total_size >= mutable_cf_options.max_bytes_for_level_base) {
// When calculating estimated_compaction_needed_bytes, we assume
// L0 is qualified as pending compactions. We will need to make
// sure that it qualifies for compaction.
// It might be guaranteed by logic below anyway, but we are
// explicit here to make sure we don't stop writes with no
// compaction scheduled.
score = std::max(score, 1.01);
}
if (total_size > level_max_bytes_[base_level_]) {
// In this case, we compare L0 size with actual LBase size and
// make sure score is more than 1.0 (10.0 after scaled) if L0 is
// larger than LBase. Since LBase score = LBase size /
// (target size + total_downcompact_bytes) where
// total_downcompact_bytes = total_size > LBase size,
// LBase score is lower than 10.0. So L0->LBase is prioritized
// over LBase -> LBase+1.
uint64_t base_level_size = 0;
for (auto f : files_[base_level_]) {
base_level_size += f->compensated_file_size;
}
score = std::max(score, static_cast<double>(total_size) /
static_cast<double>(std::max(
base_level_size,
level_max_bytes_[base_level_])));
}
if (score > 1.0) {
score *= kScoreScale;
}
} else {
score = std::max(score,
static_cast<double>(total_size) /
mutable_cf_options.max_bytes_for_level_base);
}
}
}
} else { // level > 0
// Compute the ratio of current size to size limit.
uint64_t level_bytes_no_compacting = 0;
uint64_t level_total_bytes = 0;
for (auto f : files_[level]) {
level_total_bytes += f->fd.GetFileSize();
if (!f->being_compacted) {
level_bytes_no_compacting += f->compensated_file_size;
}
}
if (!immutable_options.level_compaction_dynamic_level_bytes) {
score = static_cast<double>(level_bytes_no_compacting) /
MaxBytesForLevel(level);
} else {
if (level_bytes_no_compacting < MaxBytesForLevel(level)) {
score = static_cast<double>(level_bytes_no_compacting) /
MaxBytesForLevel(level);
} else {
// If there are a large mount of data being compacted down to the
// current level soon, we would de-prioritize compaction from
// a level where the incoming data would be a large ratio. We do
// it by dividing level size not by target level size, but
// the target size and the incoming compaction bytes.
score = static_cast<double>(level_bytes_no_compacting) /
(MaxBytesForLevel(level) + total_downcompact_bytes) *
kScoreScale;
}
// Drain unnecessary levels, but with lower priority compared to
// when L0 is eligible. Only non-empty levels can be unnecessary.
// If there is no unnecessary levels, lowest_unnecessary_level_ = -1.
if (level_bytes_no_compacting > 0 &&
level <= lowest_unnecessary_level_) {
score = std::max(
score, kScoreScale *
(1.001 + 0.001 * (lowest_unnecessary_level_ - level)));
}
}
if (level <= lowest_unnecessary_level_) {
total_downcompact_bytes += level_total_bytes;
} else if (level_total_bytes > MaxBytesForLevel(level)) {
total_downcompact_bytes +=
static_cast<double>(level_total_bytes - MaxBytesForLevel(level));
}
}
compaction_level_[level] = level;
compaction_score_[level] = score;
}
// sort all the levels based on their score. Higher scores get listed
// first. Use bubble sort because the number of entries are small.
for (int i = 0; i < num_levels() - 2; i++) {
for (int j = i + 1; j < num_levels() - 1; j++) {
if (compaction_score_[i] < compaction_score_[j]) {
double score = compaction_score_[i];
int level = compaction_level_[i];
compaction_score_[i] = compaction_score_[j];
compaction_level_[i] = compaction_level_[j];
compaction_score_[j] = score;
compaction_level_[j] = level;
}
}
}
// ...
}
相信读者此时肯定有个疑问:什么时候调用这个更新分数的函数?
根据评分规则,我们易反推出flush或compaction肯定会更新分数,毕竟各层的sst数目发生了变化。事实也的确如此,回忆之前的flush流程分析
// db/flush_job.cc
Status FlushJob::Run(LogsWithPrepTracker* prep_tracker, FileMetaData* file_meta,
bool* switched_to_mempurge, bool* skipped_since_bg_error,
ErrorHandler* error_handler) {
// ...
if (mempurge_s.ok()) {
base_->Unref();
s = Status::OK();
} else {
// 将imm刷入磁盘
s = WriteLevel0Table();
}
// ...
if (!s.ok()) {
cfd_->imm()->RollbackMemtableFlush(
mems_, /*rollback_succeeding_memtables=*/!db_options_.atomic_flush);
} else if (write_manifest_) {
if (/*...*/) {
// ...
} else {
// 正常的分支
TEST_SYNC_POINT("FlushJob::InstallResults");
// Replace immutable memtable with the generated Table
// 更新version,计算分数(compaction中讲解)
s = cfd_->imm()->TryInstallMemtableFlushResults(
cfd_, mutable_cf_options_, mems_, prep_tracker, versions_, db_mutex_,
meta_.fd.GetNumber(), &job_context_->memtables_to_free, db_directory_,
log_buffer_, &committed_flush_jobs_info_,
!(mempurge_s.ok()));
}
}
// ...
}
当时我们留了一个引子,flush刷完imm后会调用MemTableList::TryInstallMemtableFlushResults,这个函数既要更新version,还有个作用便是重新计算分数。调用链如下(暂不对这条调用链进行详细的分析)
FlushJob::Run
MemTableList::TryInstallMemtableFlushResults
VersionSet::LogAndApply
VersionSet::LogAndApply (重载)
VersionSet::ProcessManifestWrites
VersionSet::AppendVersion
VersionStorageInfo::ComputeCompactionScore
添加新的version时会重新计算score
// db/version_set.cc
void VersionSet::AppendVersion(ColumnFamilyData* column_family_data,
Version* v) {
// compute new compaction score
v->storage_info()->ComputeCompactionScore(
*column_family_data->ioptions(),
*column_family_data->GetLatestMutableCFOptions());
// ...
}
一套流程由此展开:对于一个新的数据库,每次flush将imm刷入
L
0
L_0
L0后,先更新
L
0
L_0
L0分数(此时只有这一层),再执行DBImpl::InstallSuperVersionAndScheduleWork根据分数检查是否需要compaction。随着多次flush,
L
0
L_0
L0分数上涨。最终,在某一次执行InstallSuperVersionAndScheduleWork时,发现
L
0
L_0
L0需要compaction,第一个compaction就诞生了!
此外,Compaction流程中也会更新分数(将在后文展示),它执行DBImpl::InstallSuperVersionAndScheduleWork时就可能产生新的compaction。我们先回顾目前的函数调用顺序
小结
rocksdb根据分数判断一个cfd是否需要compaction。短暂回顾目前的函数流程
DBImpl::InstallSuperVersionAndScheduleWork
DBImpl::SchedulePendingCompaction
ColumnFamily::queued_for_compaction
ColumnFamily::NeedsCompaction
CompactionPicker::NeedsCompaction ===> LevelCompactionPicker::NeedsCompaction
VersionStorageInfo::CompactionScore
分析好rocksdb如何判断一个cfd是否需要compaction,我们回到 DBImpl::InstallSuperVersionAndScheduleWork,接下来该调度compaction任务了
// db/db_impl/db_impl_compaction_flush.cc
void DBImpl::InstallSuperVersionAndScheduleWork(
ColumnFamilyData* cfd, SuperVersionContext* sv_context,
const MutableCFOptions& mutable_cf_options) {
// 持有锁
mutex_.AssertHeld();
// ...
SchedulePendingCompaction(cfd);
MaybeScheduleFlushOrCompaction();
// ...
}
这便是 DBImpl::MaybeScheduleFlushOrCompaction。之前我们分析了其中的的flush部分,现在来看compaction部分
DBImpl::MaybeScheduleFlushOrCompaction
// db/db_impl/db_impl_compaction_flush.cc
void DBImpl::MaybeScheduleFlushOrCompaction() {
mutex_.AssertHeld();
// ...
// 从compaction队列里"取"任务执行
while (bg_compaction_scheduled_ + bg_bottom_compaction_scheduled_ <
bg_job_limits.max_compactions &&
unscheduled_compactions_ > 0) {
CompactionArg* ca = new CompactionArg;
ca->db = this;
ca->compaction_pri_ = Env::Priority::LOW;
ca->prepicked_compaction = nullptr;
bg_compaction_scheduled_++;
unscheduled_compactions_--;
env_->Schedule(&DBImpl::BGWorkCompaction, ca, Env::Priority::LOW, this,
&DBImpl::UnscheduleCompactionCallback);
}
}
与flush类似,不会在此处调用 DBImpl::compaction_queue_的pop。之后的函数一定在线程池中执行。
DBImpl::BGWorkCompaction
// db/db_impl/db_impl_compaction_flush.cc
void DBImpl::BGWorkCompaction(void* arg) {
CompactionArg ca = *(reinterpret_cast<CompactionArg*>(arg)); // 这里存在拷贝构造
delete reinterpret_cast<CompactionArg*>(arg);
IOSTATS_SET_THREAD_POOL_ID(Env::Priority::LOW);
TEST_SYNC_POINT("DBImpl::BGWorkCompaction");
auto prepicked_compaction =
static_cast<PrepickedCompaction*>(ca.prepicked_compaction);
static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction(
prepicked_compaction, Env::Priority::LOW);
delete prepicked_compaction;
}
再次提醒,从这个函数开始,一定都在线程池中执行!
DBImpl::BGWorkCompaction很简短,释放上面 new来的 CompactionArg* ca,获取其中的参数(prepicked_compaction是 nullptr),然后调用 DBImpl::BackgrudCallCompaction。
DBImpl::BackgroundCallCompaction
// db/db_impl/db_impl_compaction_flush.cc
void DBImpl::BackgroundCallCompaction(PrepickedCompaction* prepicked_compaction,
Env::Priority bg_thread_pri) {
// 当前prepicked_compaction是nullptr
// bg_thread_pri是Env::Priority::LOW
// ...
{
// 需要持有锁
InstrumentedMutexLock l(&mutex_);
num_running_compactions_++;
std::unique_ptr<std::list<uint64_t>::iterator>
pending_outputs_inserted_elem(new std::list<uint64_t>::iterator(
CaptureCurrentFileNumberInPendingOutputs()));
assert((bg_thread_pri == Env::Priority::BOTTOM &&
bg_bottom_compaction_scheduled_) ||
(bg_thread_pri == Env::Priority::LOW && bg_compaction_scheduled_));
Status s = BackgroundCompaction(&made_progress, &job_context, &log_buffer,
prepicked_compaction, bg_thread_pri);
TEST_SYNC_POINT("BackgroundCallCompaction:1");
if (s.IsBusy()) {
bg_cv_.SignalAll(); // In case a waiter can proceed despite the error
mutex_.Unlock();
immutable_db_options_.clock->SleepForMicroseconds(
10000); // prevent hot loop
mutex_.Lock();
} else if (!s.ok() && !s.IsShutdownInProgress() &&
!s.IsManualCompactionPaused() && !s.IsColumnFamilyDropped()) {
// Wait a little bit before retrying background compaction in
// case this is an environmental problem and we do not want to
// chew up resources for failed compactions for the duration of
// the problem.
// BackgroundCompaction里可能返回Status::CompactionTooLarge
// 走这条分支
// ...
} else if (s.IsManualCompactionPaused()) {
// ...
}
ReleaseFileNumberFromPendingOutputs(pending_outputs_inserted_elem);
// If compaction failed, we want to delete all temporary files that we
// might have created (they might not be all recorded in job_context in
// case of a failure). Thus, we force full scan in FindObsoleteFiles()
FindObsoleteFiles(&job_context, !s.ok() && !s.IsShutdownInProgress() &&
!s.IsManualCompactionPaused() &&
!s.IsColumnFamilyDropped() &&
!s.IsBusy());
TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:FoundObsoleteFiles");
// delete unnecessary files if any, this is done outside the mutex
if (job_context.HaveSomethingToClean() ||
job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
// ...
}
assert(num_running_compactions_ > 0);
num_running_compactions_--;
if (bg_thread_pri == Env::Priority::LOW) {
// 当前流程走该分支
bg_compaction_scheduled_--;
} else {
assert(bg_thread_pri == Env::Priority::BOTTOM);
bg_bottom_compaction_scheduled_--;
}
// See if there's more work to be done
MaybeScheduleFlushOrCompaction();
// 当前流程下prepicked_compaction是nullptr
if (prepicked_compaction != nullptr &&
prepicked_compaction->task_token != nullptr) {
// ...
}
// ...
}
}
调用 DBImpl::BackgroundCompaction,参数情况
| 形参 | 实参 | 值 |
|---|---|---|
madeProgrss | &made_progress | false |
prepicked_compaction | prepicked_compaction | nullptr |
thread_pri | bg_thread_pri | Env::Priority::LOW |
DBImpl::BackgroundCompaction
// db/db_impl/db_impl_compaction_flush.cc
Status DBImpl::BackgroundCompaction(bool* made_progress,
JobContext* job_context,
LogBuffer* log_buffer,
PrepickedCompaction* prepicked_compaction,
Env::Priority thread_pri) {
// prepicked_compaction是nullptr
// 当前流程下是nullptr
ManualCompactionState* manual_compaction =
prepicked_compaction == nullptr
? nullptr
: prepicked_compaction->manual_compaction_state;
*made_progress = false;
mutex_.AssertHeld();
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:Start");
const ReadOptions read_options(Env::IOActivity::kCompaction);
const WriteOptions write_options(Env::IOActivity::kCompaction);
// 当前流程下是false
bool is_manual = (manual_compaction != nullptr);
std::unique_ptr<Compaction> c;
if (prepicked_compaction != nullptr &&
prepicked_compaction->compaction != nullptr) {
c.reset(prepicked_compaction->compaction);
}
// 当前流程下是false
bool is_prepicked = is_manual || c;
// (manual_compaction->in_progress == false);
bool trivial_move_disallowed =
is_manual && manual_compaction->disallow_trivial_move;
// 因为prepicked_compaction是nullptr, 所以
// manual_compaction=nullptr
// is_manual=false
// c=nullptr
// is_prepicked=false
// trivial_move_disallowed=false
CompactionJobStats compaction_job_stats;
Status status;
// ...
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:InProgress");
// 用来控制compaction频率
std::unique_ptr<TaskLimiterToken> task_token;
bool sfm_reserved_compact_space = false;
if (is_manual) {
// 当前流程下是false略过...
} else if (!is_prepicked && !compaction_queue_.empty()) {
if (HasExclusiveManualCompaction()) {
// Can't compact right now, but try again later
TEST_SYNC_POINT("DBImpl::BackgroundCompaction()::Conflict");
// Stay in the compaction queue.
unscheduled_compactions_++;
return Status::OK();
}
// 从compaction_queue_取出需要compaction的column family
auto cfd = PickCompactionFromQueue(&task_token, log_buffer);
// ...
auto* mutable_cf_options = cfd->GetLatestMutableCFOptions();
if (!mutable_cf_options->disable_auto_compactions && !cfd->IsDropped()) {
// ...
// 选择参与compaction的文件,给指针c赋值
c.reset(cfd->PickCompaction(*mutable_cf_options, mutable_db_options_,
log_buffer));
TEST_SYNC_POINT("DBImpl::BackgroundCompaction():AfterPickCompaction");
if (c != nullptr) {
bool enough_room = EnoughRoomForCompaction(
cfd, *(c->inputs()), &sfm_reserved_compact_space, log_buffer);
if (!enough_room) {
// Then don't do the compaction
// ...
// Don't need to sleep here, because BackgroundCallCompaction
// will sleep if !s.ok()
status = Status::CompactionTooLarge();
} else {
// update statistics
// ...
// 当这些files被挑选后,层的分数就发生了变化
// 检查该column family是否还需要compaction
// 如果是,则加入compaction_queue_队列
// 并执行MaybeScheduleFlushOrCompaction
// ...
}
}
}
}
IOStatus io_s;
bool compaction_released = false;
// compaction分几种
// 可直接删除input的deletion compaction
// 直接将输入文件移动到下一层的trivial move compaction
// forward compaction
// 常规compaction
if (!c) {
// Nothing to do
ROCKS_LOG_BUFFER(log_buffer, "Compaction nothing to do");
} else if (c->deletion_compaction()) {
// ...
} else if (!trivial_move_disallowed && c->IsTrivialMove()) {
// ...
} else if (!is_prepicked && c->output_level() > 0 && /*...*/) {
// Forward compactions involving last level to the bottom pool if it exists,
// such that compactions unlikely to contribute to write stalls can be
// delayed or deprioritized.
// ...
} else {
// ...
// 创建CompactionJob
CompactionJob compaction_job(
job_context->job_id, c.get(), immutable_db_options_,
mutable_db_options_, file_options_for_compaction_, versions_.get(),
&shutting_down_, log_buffer, directories_.GetDbDir(),
GetDataDir(c->column_family_data(), c->output_path_id()),
GetDataDir(c->column_family_data(), 0), stats_, &mutex_,
&error_handler_, snapshot_seqs, earliest_write_conflict_snapshot,
snapshot_checker, job_context, table_cache_, &event_logger_,
c->mutable_cf_options()->paranoid_file_checks,
c->mutable_cf_options()->report_bg_io_stats, dbname_,
&compaction_job_stats, thread_pri, io_tracer_,
is_manual ? manual_compaction->canceled
: kManualCompactionCanceledFalse_,
db_id_, db_session_id_, c->column_family_data()->GetFullHistoryTsLow(),
c->trim_ts(), &blob_callback_, &bg_compaction_scheduled_,
&bg_bottom_compaction_scheduled_);
// 相较于flush,compaction需要调用CompactionJob::Prepare
compaction_job.Prepare();
NotifyOnCompactionBegin(c->column_family_data(), c.get(), status,
compaction_job_stats, job_context->job_id);
// 对于compaction而言, 此处就释放了锁!
mutex_.Unlock();
TEST_SYNC_POINT_CALLBACK(
"DBImpl::BackgroundCompaction:NonTrivial:BeforeRun", nullptr);
// Should handle error?
// 运行compaction job
compaction_job.Run().PermitUncheckedError();
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:NonTrivial:AfterRun");
// 再次获取锁
mutex_.Lock();
status =
compaction_job.Install(*c->mutable_cf_options(), &compaction_released);
io_s = compaction_job.io_status();
if (status.ok()) {
InstallSuperVersionAndScheduleWork(c->column_family_data(),
&job_context->superversion_contexts[0],
*c->mutable_cf_options());
}
*made_progress = true;
TEST_SYNC_POINT_CALLBACK("DBImpl::BackgroundCompaction:AfterCompaction",
c->column_family_data());
}
if (status.ok() && !io_s.ok()) {
status = io_s;
} else {
io_s.PermitUncheckedError();
}
if (c != nullptr) {
if (!compaction_released) {
c->ReleaseCompactionFiles(status);
} else {
// ...
}
*made_progress = true;
// Need to make sure SstFileManager does its bookkeeping
// DBImpl::FlushMemTableToOutputFile里有类似的流程
auto sfm = static_cast<SstFileManagerImpl*>(
immutable_db_options_.sst_file_manager.get());
if (sfm && sfm_reserved_compact_space) {
sfm->OnCompactionCompletion(c.get());
}
NotifyOnCompactionCompleted(c->column_family_data(), c.get(), status,
compaction_job_stats, job_context->job_id);
}
if (status.ok() || status.IsCompactionTooLarge() ||
status.IsManualCompactionPaused()) {
// compaction没问题走该分支
// Done
} else if (status.IsColumnFamilyDropped() || status.IsShutdownInProgress()) {
// Ignore compaction errors found during shutting down
} else {
// 其它问题走该分支
}
// this will unref its input_version and column_family_data
c.reset();
// ...
TEST_SYNC_POINT("DBImpl::BackgroundCompaction:Finish");
return status;
}
该函数的核心逻辑是
- 先从
DBImpl::compaction_queue_取出需要compaction的column family - 再调用
ColumnFamilyData::PickCompaction选择需要compaction的文件 - 然后生成
CompactionJob,并执行
重点来了,之前流程只是选择了需要compaction的column family,如何确定层和sst文件呢?ColumnFamilyData::PickCompaction负责这块。注意,整个挑选sst的过程被锁保护!
确定层和选择sst文件
官方wikiChoose Level Compaction Files详细总结了compaction如何确定层以及如何选择sst文件。建议先阅读它。
接下来我们从源码来理解这个过程
ColumnFamilyData::PickCompaction
// db/column_family.cc
Compaction* ColumnFamilyData::PickCompaction(
const MutableCFOptions& mutable_options,
const MutableDBOptions& mutable_db_options, LogBuffer* log_buffer) {
auto* result = compaction_picker_->PickCompaction(
GetName(), mutable_options, mutable_db_options, current_->storage_info(),
log_buffer);
if (result != nullptr) {
result->FinalizeInputInfo(current_);
}
return result;
}
可见,每个column family有自己的 CompactionPicker,这个picker负责该column family的所有compaction挑选任务。CompactionPicker::PickCompaction是一个纯虚函数,道理也很简单,因为compaction策略不止一种。我们现在分析的是level策略,因此看level的实现
LevelCompactionPicker::PickCompaction
// db/compaction/compaction_picker_level.cc
Compaction* LevelCompactionPicker::PickCompaction(
const std::string& cf_name, const MutableCFOptions& mutable_cf_options,
const MutableDBOptions& mutable_db_options, VersionStorageInfo* vstorage,
LogBuffer* log_buffer) {
LevelCompactionBuilder builder(cf_name, vstorage, this, log_buffer,
mutable_cf_options, ioptions_,
mutable_db_options);
return builder.PickCompaction();
}
每调用一次,这个函数就会构造一个新的 LevelCompactionBuilder,然后调用 LevelCompactionBuilder::PickCompaction。为什么每次要构建新的builder,因为传入的 vstorage信息不同!又因为此处传入了 this指针,所以 LevelCompactionBuilder中也能访问 ColumnFamilyData::compaction_picker_。
LevelCompactionBuilder::PickCompaction
// db/compaction/compaction_picker_level.cc
Compaction* LevelCompactionBuilder::PickCompaction() {
// Pick up the first file to start compaction. It may have been extended
// to a clean cut.
SetupInitialFiles();
if (start_level_inputs_.empty()) {
return nullptr;
}
assert(start_level_ >= 0 && output_level_ >= 0);
// If it is a L0 -> base level compaction, we need to set up other L0
// files if needed.
if (!SetupOtherL0FilesIfNeeded()) {
return nullptr;
}
// Pick files in the output level and expand more files in the start level
// if needed.
if (!SetupOtherInputsIfNeeded()) {
return nullptr;
}
// Form a compaction object containing the files we picked.
Compaction* c = GetCompaction();
TEST_SYNC_POINT_CALLBACK("LevelCompactionPicker::PickCompaction:Return", c);
return c;
}
如注释所言,该函数逻辑非常清楚:假设此次compaction发起层是 L i L_i Li,输出层是 L o L_o Lo
- 调用
LevelCompactionBuilder::SetupInitialFiles选择 L i L_i Li中的sst,它们存于start_level_inputs_ - 如果
start_level_inputs_为空,返回nullptr,表示无法产生一个compaction - 如果 L i L_i Li是 L 0 L_0 L0,则还需要选择其它的sst,因为 L 0 L_0 L0中sst有重叠
- 调用
LevelCompactionBuilder::SetupOtherInputsIfNeeded选择 L o L_o Lo的sst - 调用
LevelCompactionBuilder::GetCompaction构建包含这些sst的Compaction
事实上,目前我们压根不知道发起层
L
i
L_i
Li和输出层
L
o
L_o
Lo是哪两层!sst的选择依赖于层的选择,这些都由 LevelCompactionBuilder::SetupInitialFiles完成
LevelCompactionBuilder::SetupInitialFiles
// db/compaction/compaction_picker_level.cc
void LevelCompactionBuilder::SetupInitialFiles() {
// Find the compactions by size on all levels.
bool skipped_l0_to_base = false;
for (int i = 0; i < compaction_picker_->NumberLevels() - 1; i++) {
start_level_score_ = vstorage_->CompactionScore(i);
start_level_ = vstorage_->CompactionScoreLevel(i);
assert(i == 0 || start_level_score_ <= vstorage_->CompactionScore(i - 1));
if (start_level_score_ >= 1) {
if (skipped_l0_to_base && start_level_ == vstorage_->base_level()) {
// If L0->base_level compaction is pending, don't schedule further
// compaction from base level. Otherwise L0->base_level compaction
// may starve.
continue;
}
output_level_ =
(start_level_ == 0) ? vstorage_->base_level() : start_level_ + 1;
bool picked_file_to_compact = PickFileToCompact();
TEST_SYNC_POINT_CALLBACK("PostPickFileToCompact",
&picked_file_to_compact);
if (picked_file_to_compact) {
// found the compaction!
if (start_level_ == 0) {
// L0 score = `num L0 files` / `level0_file_num_compaction_trigger`
compaction_reason_ = CompactionReason::kLevelL0FilesNum;
} else {
// L1+ score = `Level files size` / `MaxBytesForLevel`
compaction_reason_ = CompactionReason::kLevelMaxLevelSize;
}
break;
} else {
// didn't find the compaction, clear the inputs
start_level_inputs_.clear();
if (start_level_ == 0) {
skipped_l0_to_base = true;
// L0->base_level may be blocked due to ongoing L0->base_level
// compactions. It may also be blocked by an ongoing compaction from
// base_level downwards.
//
// In these cases, to reduce L0 file count and thus reduce likelihood
// of write stalls, we can attempt compacting a span of files within
// L0.
if (PickIntraL0Compaction()) {
output_level_ = 0;
compaction_reason_ = CompactionReason::kLevelL0FilesNum;
break;
}
}
}
} else {
// Compaction scores are sorted in descending order, no further scores
// will be >= 1.
break;
}
}
if (!start_level_inputs_.empty()) {
return;
}
// if we didn't find a compaction, check if there are any files marked for
// compaction
// 暂时忽略其它情况
}
整段代码的逻辑是:
-
根据score由高到低遍历层
-
输入层 L i L_i Li为
start_level_,分数是start_level_score_。如果start_level_score_小于1直接break -
确定输出层 L o L_o Lo
output_level_,如果 L i L_i Li 是 L 0 L_0 L0,则输出层 L o L_o Lo是base_level(),否则 L i + 1 L_{i+1} Li+1即下一层 -
调用
LevelCompactionBuilder::PickFileToCompact挑选出输入文件start_level_inputs_:-
如果函数返回false,表明无法选出sst,当输入层是 L 0 L_0 L0时,调用
PickIntraL0Compaction::PickIntraL0Compaction尝试 L 0 L_0 L0内的compaction -
如果函数返回true,表明能选出sst,那么任务就完成了,break
-
-
由此可见,compaction挑选分数最高(且大于1)的层,这回答了当有多层
LevelCompactionBuilder::PickFileToCompact
// db/compaction/compaction_picker_level.cc
bool LevelCompactionBuilder::PickFileToCompact() {
// level 0 files are overlapping. So we cannot pick more
// than one concurrent compactions at this level. This
// could be made better by looking at key-ranges that are
// being compacted at level 0.
if (start_level_ == 0 &&
!compaction_picker_->level0_compactions_in_progress()->empty()) {
TEST_SYNC_POINT("LevelCompactionPicker::PickCompactionBySize:0");
return false;
}
start_level_inputs_.clear();
start_level_inputs_.level = start_level_;
assert(start_level_ >= 0);
if (TryPickL0TrivialMove()) {
return true;
}
const std::vector<FileMetaData*>& level_files =
vstorage_->LevelFiles(start_level_);
// Pick the file with the highest score in this level that is not already
// being compacted.
const std::vector<int>& file_scores =
vstorage_->FilesByCompactionPri(start_level_);
unsigned int cmp_idx;
for (cmp_idx = vstorage_->NextCompactionIndex(start_level_);
cmp_idx < file_scores.size(); cmp_idx++) {
int index = file_scores[cmp_idx];
auto* f = level_files[index];
// do not pick a file to compact if it is being compacted
// from n-1 level.
if (f->being_compacted) {
if (ioptions_.compaction_pri == kRoundRobin) {
// TODO(zichen): this file may be involved in one compaction from
// an upper level, cannot advance the cursor for round-robin policy.
// Currently, we do not pick any file to compact in this case. We
// should fix this later to ensure a compaction is picked but the
// cursor shall not be advanced.
return false;
}
continue;
}
start_level_inputs_.files.push_back(f);
if (!compaction_picker_->ExpandInputsToCleanCut(cf_name_, vstorage_,
&start_level_inputs_) ||
compaction_picker_->FilesRangeOverlapWithCompaction(
{start_level_inputs_}, output_level_,
Compaction::EvaluatePenultimateLevel(
vstorage_, ioptions_, start_level_, output_level_))) {
// A locked (pending compaction) input-level file was pulled in due to
// user-key overlap.
start_level_inputs_.clear();
if (ioptions_.compaction_pri == kRoundRobin) {
return false;
}
continue;
}
// Now that input level is fully expanded, we check whether any output
// files are locked due to pending compaction.
//
// Note we rely on ExpandInputsToCleanCut() to tell us whether any output-
// level files are locked, not just the extra ones pulled in for user-key
// overlap.
InternalKey smallest, largest;
compaction_picker_->GetRange(start_level_inputs_, &smallest, &largest);
CompactionInputFiles output_level_inputs;
output_level_inputs.level = output_level_;
vstorage_->GetOverlappingInputs(output_level_, &smallest, &largest,
&output_level_inputs.files);
if (output_level_inputs.empty()) {
if (start_level_ > 0 &&
TryExtendNonL0TrivialMove(index,
ioptions_.compaction_pri ==
kRoundRobin /* only_expand_right */)) {
break;
}
} else {
if (!compaction_picker_->ExpandInputsToCleanCut(cf_name_, vstorage_,
&output_level_inputs)) {
start_level_inputs_.clear();
if (ioptions_.compaction_pri == kRoundRobin) {
return false;
}
continue;
}
}
base_index_ = index;
break;
}
// store where to start the iteration in the next call to PickCompaction
if (ioptions_.compaction_pri != kRoundRobin) {
vstorage_->SetNextCompactionIndex(start_level_, cmp_idx);
}
return start_level_inputs_.size() > 0;
}
这段代码的逻辑为:
-
如果 L i L_i Li是 L 0 L_0 L0,目前有 L 0 → base_level L_0 \to \text{base\_level} L0→base_level的compaction,那么终止挑选。 L 0 L_0 L0的compaction无法并发执行(存疑)
-
在 L i L_i Li中,根据compaction priority分数由高到低依次检查sst,for循环只用挑选一个:
-
如果该sst正在被compact,检查下一个continue
-
将它放进
start_level_inputs_ -
调用
CompactionPicker::ExpandInputsToCleanCut对该sst执行clean cut操作,从 L i L_i Li中选出更多相连的sst放进start_level_inputs_,如果这些相连的sst有正在被compact,清空start_level_inputs_,continue -
调用
CompactionPicker::FilesRangeOverlapWithCompaction对clean cut后的start_level_inputs_做检查。如果不合格,则清空start_level_inputs_,continue -
从 L o L_o Lo中找到所有和
start_level_inputs_存在重叠的sst,把它们放进output_level_inputs -
调用
CompactionPicker::ExpandInputsToCleanCut对这些sst进行clean cut,从 L o L_o Lo中选出更多相连的sst放进output_level_inputs,如果这些相连的sst有正在被compact,清空start_level_inputs_,continue -
将该sst序号记录于
base_index_,break
-
-
如果
start_level_inputs_中有sst,返回true
这里有必要说clean cut是什么。回到LevelCompactionBuilder::PickCompaction。注意,output_level_inputs只是一个局部变量,换言之,
L
o
L_o
Lo的sst挑选并不在此。
// db/compaction/compaction_picker_level.cc
Compaction* LevelCompactionBuilder::PickCompaction() {
SetupInitialFiles();
if (start_level_inputs_.empty()) {
return nullptr;
}
assert(start_level_ >= 0 && output_level_ >= 0);
// If it is a L0 -> base level compaction, we need to set up other L0
// files if needed.
if (!SetupOtherL0FilesIfNeeded()) {
return nullptr;
}
// ...
return c;
}
当前我们已确定了
L
i
L_i
Li,并从
L
i
L_i
Li中选好了sst,该执行LevelCompactionBuilder::SetupOtherL0FilesIfNeeded了
LevelCompactionBuilder::SetupOtherL0FilesIfNeeded
// db/compaction/compaction_picker_level.cc
bool LevelCompactionBuilder::SetupOtherL0FilesIfNeeded() {
if (start_level_ == 0 && output_level_ != 0 && !is_l0_trivial_move_) {
return compaction_picker_->GetOverlappingL0Files(
vstorage_, &start_level_inputs_, output_level_, &parent_index_);
}
return true;
}
这个函数调用CompactionPicker::GetOverlappingL0Files对start_level_inputs_进行拓展,其间会检查拓展后的sst们有无正在被compact,如果有则返回false。
再回到LevelCompactionBuilder::PickCompaction,
Compaction* LevelCompactionBuilder::PickCompaction() {
// ...
// If it is a L0 -> base level compaction, we need to set up other L0
// files if needed.
if (!SetupOtherL0FilesIfNeeded()) {
return nullptr;
}
// Pick files in the output level and expand more files in the start level
// if needed.
if (!SetupOtherInputsIfNeeded()) {
return nullptr;
}
// ...
return c;
}
下一步是LevelCompactionBuilder::SetupOtherInputsIfNeeded
LevelCompactionBuilder::SetupOtherInputsIfNeeded
暂且不分析
至此,我们完成了
L
i
L_i
Li和
L
o
L_o
Lo的确定,把来自
L
i
L_i
Li的输入start_level_inputs_放进compaction_inputs_,再放来自
L
o
L_o
Lo的输入output_level_inputs_,现在所有参与本次compaction的sst都在compaction_inputs_里。最后,LevelCompactionBuilder::PickCompaction里再调用LevelCompactionBuilder::GetCompaction把这些资源转移到Compaction,返回它。
// db/compaction/compaction_picker_level.cc
Compaction* LevelCompactionPicker::PickCompaction(
const std::string& cf_name, const MutableCFOptions& mutable_cf_options,
const MutableDBOptions& mutable_db_options, VersionStorageInfo* vstorage,
LogBuffer* log_buffer) {
LevelCompactionBuilder builder(cf_name, vstorage, this, log_buffer,
mutable_cf_options, ioptions_,
mutable_db_options);
return builder.PickCompaction();
}
LevelCompactionPicker::PickCompaction返回这个comapction指针给调用方ColumnFamilyData::PickCompaction
// db/column_family.cc
Compaction* ColumnFamilyData::PickCompaction(
const MutableCFOptions& mutable_options,
const MutableDBOptions& mutable_db_options, LogBuffer* log_buffer) {
auto* result = compaction_picker_->PickCompaction(
GetName(), mutable_options, mutable_db_options, current_->storage_info(),
log_buffer);
if (result != nullptr) {
result->FinalizeInputInfo(current_);
}
return result;
}
最后返回给
// db/db_impl/db_impl_compaction_flush.cc
Status DBImpl::BackgroundCompaction(bool* made_progress,
JobContext* job_context,
LogBuffer* log_buffer,
PrepickedCompaction* prepicked_compaction,
Env::Priority thread_pri) {
//...
auto cfd = PickCompactionFromQueue(&task_token, log_buffer);
// ...
if (!mutable_cf_options->disable_auto_compactions && !cfd->IsDropped()) {
// ...
c.reset(cfd->PickCompaction(*mutable_cf_options, mutable_db_options_,
log_buffer));
}
//...
}
小结
由于确定层和挑选sst需要持有锁,所以同一个sst不可能被两个compaction任务同时选取,sst视图的一致性得到了初步保证。对于层的确定,
选择分数最高(得大于1)的层;对于sst的选择,先选择优先级最高的一个sst再clean cut扩展。回顾目前的函数流程
DBImpl::InstallSuperVersionAndScheduleWork
DBImpl::SchedulePendingCompaction #判断cfd是否需要compaction
ColumnFamily::queued_for_compaction
ColumnFamily::NeedsCompaction
CompactionPicker::NeedsCompaction ===> LevelCompactionPicker::NeedsCompaction
VersionStorageInfo::CompactionScore
DBImpl::MaybeScheduleFlushOrCompaction
DBImpl::BGWorkCompaction
DBImpl::BackgroundCallCompaction
DBImpl::BackgroundCompaction
ColumnFamilyData::PickCompaction ===> LevelCompactionPicker::PickCompaction
LevelCompactionBuilder::PickCompaction
LevelCompactionBuilder::SetupInitialFiles # 确定层和起始文件
LevelCompactionBuilder::SetupOtherL0FilesIfNeeded
LevelCompactionBuilder::SetupOtherInputsIfNeeded
LevelCompactionBuilder::GetCompaction # 打包选择的sst生成一个Compaction对象
接下来,我们回到DBImpl::BackgroundCompaction,当生成CompactionJob后,调用其Prepare方法。这与flush不一样,FlushJob中没有同名方法。因为rocksdb支持将一个compaction任务划分成多个子任务,使用多线程并发执行它们。划分的工作正是由CompactionJob::Prepare完成。
划分子compaction
CompactionJob::Prepare
未完待续
Prepare中完成了子任务的划分,我们再次回到DBImpl::BackgroundCompaction,该执行CompactionJob::Run了。
CompactionJob::Run
未完待续
总结
未完待续
1413

被折叠的 条评论
为什么被折叠?



