由上文可知,合并主要分为三种:
1)对Memtable进行合并
2)trivial Compaction,直接将文件移动到下一层
3)一般的合并,调用DoCompactionWork()实现
下面将具体介绍其实现。
1、Memtable的合并
对Memtable的合并,调用DBImpl::CompactMemTable()完成
- void DBImpl::CompactMemTable() {
- mutex_.AssertHeld();
- assert(imm_ != NULL);
-
- VersionEdit edit;
- Version* base = versions_->current();
- base->Ref();
- Status s = WriteLevel0Table(imm_, &edit, base);
- base->Unref();
-
- if (s.ok()) {
- edit.SetPrevLogNumber(0);
- edit.SetLogNumber(logfile_number_);
- s = versions_->LogAndApply(&edit, &mutex_);
- }
-
- if (s.ok()) {
- imm_->Unref();
- imm_ = NULL;
- has_imm_.Release_Store(NULL);
- DeleteObsoleteFiles();
- } else {
- RecordBackgroundError(s);
- }
- }
其中主要调用了两个函数:WriteLevel0Table()和versions_->LogAndApply()
1)首先调用WriteLevel0Table(),在WriteLevel0Table()中:
1. 首先调用BuildTable()将Immutable Memtable中所有的数据写入到一个.sst文件中,并将.sst文件的信息(文件编号,Key值范围,文件大小)记录到变量meta中。由于Memtable是基于Skiplist的,是一个有序表,因此在写入.sst文件时,Key值也是从小到大来排列的。可以发现,将Memtable中的数据转换为SSTable时,是将所有记录都写入SSTable的,要删除的记录也一样。删除操作会在更高level的Compaction中完成。因此level 0中可能会存在Key值相同的记录。
2. 然后调用PickLevelForMemTableOutput()为Memtable合并的输出文件选择合适的level,并调用edit->AddFile()将生成的.sst文件加入到该level中
- Status DBImpl::WriteLevel0Table(MemTable* mem, VersionEdit* edit,
- Version* base) {
- mutex_.AssertHeld();
- FileMetaData meta;
- meta.number = versions_->NewFileNumber();
- pending_outputs_.insert(meta.number);
- Iterator* iter = mem->NewIterator();
-
- Status s;
- {
- mutex_.Unlock();
- s = BuildTable(dbname_, env_, options_, table_cache_, iter, &meta);
- mutex_.Lock();
- }
-
- delete iter;
- pending_outputs_.erase(meta.number);
-
- int level = 0;
- if (s.ok() && meta.file_size > 0) {
- const Slice min_user_key = meta.smallest.user_key();
- const Slice max_user_key = meta.largest.user_key();
- if (base != NULL) {
- level = base->PickLevelForMemTableOutput(min_user_key, max_user_key);
- }
- edit->AddFile(level, meta.number, meta.file_size,meta.smallest, meta.largest);
- }
- return s;
- }
2)然后调用versions_->LogAndApply()基于当前版本和更改edit来得到一个新版本
2、trivial Compaction
由之前的分析可知,is_manual默认为false,会调用PickCompaction()来选出要进行合并的level和相应的输入文件。
当c->IsTrivialMove()满足时,则直接将文件移动到下一level
- c = versions_->PickCompaction();
-
- Status status;
- if (c == NULL) {
-
- } else if (!is_manual && c->IsTrivialMove()) {
-
- assert(c->num_input_files(0) == 1);
- FileMetaData* f = c->input(0, 0);
- c->edit()->DeleteFile(c->level(), f->number);
- c->edit()->AddFile(c->level() + 1, f->number, f->file_size,
- f->smallest, f->largest);
- status = versions_->LogAndApply(c->edit(), &mutex_);
- }
1)首先调用PickCompaction()为接下来的Compaction操作准备输入数据
由之前对Compaction的数据结构分析可知,Compaction操作有两种触发方式:
- 某一level的文件数太多
- 某一文件的查找次数超过允许值
在进行合并时,将优先考虑文件数过多的情况
- Compaction* VersionSet::PickCompaction() {
- Compaction* c;
- int level;
-
- const bool size_compaction = (current_->compaction_score_ >= 1);
- const bool seek_compaction = (current_->file_to_compact_ != NULL);
- if (size_compaction) {
- level = current_->compaction_level_;
- c = new Compaction(level);
-
- for (size_t i = 0; i < current_->files_[level].size(); i++) {
- FileMetaData* f = current_->files_[level][i];
- if (compact_pointer_[level].empty() ||
- icmp_.Compare(f->largest.Encode(), compact_pointer_[level]) > 0) {
- c->inputs_[0].push_back(f);
- break;
- }
- }
- if (c->inputs_[0].empty()) {
- c->inputs_[0].push_back(current_->files_[level][0]);
- }
- } else if (seek_compaction) {
- level = current_->file_to_compact_level_;
- c = new Compaction(level);
- c->inputs_[0].push_back(current_->file_to_compact_);
- } else {
- return NULL;
- }
-
- c->input_version_ = current_;
- c->input_version_->Ref();
-
-
- if (level == 0) {
- InternalKey smallest, largest;
- GetRange(c->inputs_[0], &smallest, &largest);
- current_->GetOverlappingInputs(0, &smallest, &largest, &c->inputs_[0]);
- assert(!c->inputs_[0].empty());
- }
- SetupOtherInputs(c);
- return c;
- }
2)判断是否为trivial Compaction
- bool Compaction::IsTrivialMove() const {
- return (num_input_files(0) == 1 &&
- num_input_files(1) == 0 &&
- TotalFileSize(grandparents_) <= kMaxGrandParentOverlapBytes);
- }
当为trivial Compaction时,只需要简单的将level层的文件移动到level +1 层即可
3)然后完成Compaction操作
- c->edit()->DeleteFile(c->level(), f->number);
- c->edit()->AddFile(c->level() + 1, f->number, f->file_size,f->smallest, f->largest);
- status = versions_->LogAndApply(c->edit(), &mutex_);
将文件从level层删除,并将其加入到level +1 层中,再调用LogAndApply()得到新的Version
3、一般的合并
调用DBImpl::DoCompactionWork()完成,compact是调用VersionSet::PickCompacttion()得到的,与之前的trivial Compaction相同。
不同level之间,可能存在Key值相同的记录,但是记录的seq不同。由之前的分析可知,最新的数据存放在较低的level中,其对应的seq也一定level+1中的记录的seq要大,因此当出现相同Key值的记录时,只需要记录第一条记录,后面的都可以丢弃。
level 0中也可能存在Key值相同的数据,其后面的seq也不同。数据越新,其对应的seq越大,且记录在level 0中的记录是按照user_key递增,seq递减的方式存储的,则相同user_key对应的记录是聚集在一起的,且按照seq递减的方式存放的。在更高层的Compaction时,只需要处理第一条出现的user_key相同的记录即可,后面的相同user_key的记录都可以丢弃。
因此合并后的level +1层的文件中不会存在Key值相同的记录。
删除记录的操作也会在此时完成,删除数据的记录会被丢弃,而不会被写入到更高level的文件中。
- Status DBImpl::DoCompactionWork(CompactionState* compact) {
- if (snapshots_.empty()) {
- compact->smallest_snapshot = versions_->LastSequence();
- } else {
- compact->smallest_snapshot = snapshots_.oldest()->number_;
- }
- mutex_.Unlock();
-
- Iterator* input = versions_->MakeInputIterator(compact->compaction);
- input->SeekToFirst();
- Status status;
- ParsedInternalKey ikey;
- std::string current_user_key;
- bool has_current_user_key = false;
- SequenceNumber last_sequence_for_key = kMaxSequenceNumber;
- for (; input->Valid() && !shutting_down_.Acquire_Load(); ) {
- if (has_imm_.NoBarrier_Load() != NULL) {
- mutex_.Lock();
- if (imm_ != NULL) {
- CompactMemTable();
- bg_cv_.SignalAll();
- }
- mutex_.Unlock();
- }
-
- Slice key = input->key();
- if (compact->compaction->ShouldStopBefore(key) &&
- compact->builder != NULL) {
- status = FinishCompactionOutputFile(compact, input);
- }
-
- bool drop = false;
- if (!ParseInternalKey(key, &ikey)) {
- current_user_key.clear();
- has_current_user_key = false;
- last_sequence_for_key = kMaxSequenceNumber;
- } else {
- if (!has_current_user_key ||
- user_comparator()->Compare(ikey.user_key,
- Slice(current_user_key)) != 0) {
-
- current_user_key.assign(ikey.user_key.data(), ikey.user_key.size());
- has_current_user_key = true;
- last_sequence_for_key = kMaxSequenceNumber;
- }
-
- if (last_sequence_for_key <= compact->smallest_snapshot) {
- drop = true;
- } else if (ikey.type == kTypeDeletion &&
- ikey.sequence <= compact->smallest_snapshot &&
- compact->compaction->IsBaseLevelForKey(ikey.user_key)) {
- drop = true;
- }
- last_sequence_for_key = ikey.sequence;
- }
-
- if (!drop) {
- if (compact->builder == NULL) {
- status = OpenCompactionOutputFile(compact);
- }
- if (compact->builder->NumEntries() == 0) {
- compact->current_output()->smallest.DecodeFrom(key);
- }
- compact->current_output()->largest.DecodeFrom(key);
- compact->builder->Add(key, input->value());
-
- if (compact->builder->FileSize() >=
- compact->compaction->MaxOutputFileSize()) {
- status = FinishCompactionOutputFile(compact, input);
- }
- }
- input->Next();
- }
-
- if (status.ok() && compact->builder != NULL) {
- status = FinishCompactionOutputFile(compact, input);
- }
- if (status.ok()) {
- status = input->status();
- }
- delete input;
- input = NULL;
-
- mutex_.Lock();
- if (status.ok()) {
- status = InstallCompactionResults(compact);
- }
- return status;
- }
首先将可以留下的记录写入到.sst文件中,并将相关信息保存在变量compact中,然后调用InstallCompactionResults()将所做的改动加入到VersionEdit中,再调用LogAndApply()来得到新的版本。
- Status DBImpl::InstallCompactionResults(CompactionState* compact) {
- mutex_.AssertHeld();
-
- compact->compaction->AddInputDeletions(compact->compaction->edit());
- const int level = compact->compaction->level();
- for (size_t i = 0; i < compact->outputs.size(); i++) {
- const CompactionState::Output& out = compact->outputs[i];
- compact->compaction->edit()->AddFile(level + 1,
- out.number, out.file_size, out.smallest, out.largest);
- }
- return versions_->LogAndApply(compact->compaction->edit(), &mutex_);
- }
4、LogAndApply()
在上面三种不同的Compaction操作中,最终当对当前版本的更改VersionEdit全部完成后,都会调用LogAndApply()来应用更改,创建新版本的。
edit中保存了level和level+1层要删除和增加的文件
- Status VersionSet::LogAndApply(VersionEdit* edit, port::Mutex* mu) {
-
- Version* v = new Version(this);
- {
- Builder builder(this, current_);
- builder.Apply(edit);
- builder.SaveTo(v);
- }
- Finalize(v);
-
- std::string new_manifest_file;
- Status s;
- if (descriptor_log_ == NULL) {
- assert(descriptor_file_ == NULL);
- new_manifest_file = DescriptorFileName(dbname_, manifest_file_number_);
- edit->SetNextFile(next_file_number_);
- s = env_->NewWritableFile(new_manifest_file, &descriptor_file_);
- if (s.ok()) {
- descriptor_log_ = new log::Writer(descriptor_file_);
- s = WriteSnapshot(descriptor_log_);
- }
- }
- {
- mu->Unlock();
- if (s.ok()) {
- std::string record;
- edit->EncodeTo(&record);
- s = descriptor_log_->AddRecord(record);
- if (s.ok()) {
- s = descriptor_file_->Sync();
- }
- }
- if (s.ok() && !new_manifest_file.empty()) {
- s = SetCurrentFile(env_, dbname_, manifest_file_number_);
- }
- mu->Lock();
- }
-
- if (s.ok()) {
- AppendVersion(v);
- log_number_ = edit->log_number_;
- prev_log_number_ = edit->prev_log_number_;
- }
- }
- return s;
- }
为了重启之后能恢复数据库之前的状态,就需要将数据库的历史变化信息记录下来,这些信息都是记录在Manifest文件中的。为了节省空间和时间,leveldb采用的是在系统开始完整的所有数据库的信息(WriteSnapShot()),以后则只记录数据库的变化,即VersionEdit中的信息(descriptor_log_->AddRecord())。恢复时,只需要根据Manifest中的信息就可以一步步的恢复到上次的状态。
1)首先创建一个新的Version,然后调用builder.Apply(edit)将edit中所有要删除、增加的文件编号记录下来,其实现如下:
-
- void Apply(VersionEdit* edit) {
-
- for (size_t i = 0; i < edit->compact_pointers_.size(); i++) {
- const int level = edit->compact_pointers_[i].first;
- vset_->compact_pointer_[level] =
- edit->compact_pointers_[i].second.Encode().ToString();
- }
-
- const VersionEdit::DeletedFileSet& del = edit->deleted_files_;
- for (VersionEdit::DeletedFileSet::const_iterator iter = del.begin();
- iter != del.end();++iter) {
- const int level = iter->first;
- const uint64_t number = iter->second;
- levels_[level].deleted_files.insert(number);
- }
-
- for (size_t i = 0; i < edit->new_files_.size(); i++) {
- const int level = edit->new_files_[i].first;
- FileMetaData* f = new FileMetaData(edit->new_files_[i].second);
- f->refs = 1;
- f->allowed_seeks = (f->file_size / 16384);
- if (f->allowed_seeks < 100) f->allowed_seeks = 100;
- levels_[level].deleted_files.erase(f->number);
- levels_[level].added_files->insert(f);
- }
- }
2)然后再调用builder.SaveTo(v)将更改保存到新的Version中,其实现如下
- void SaveTo(Version* v) {
- BySmallestKey cmp;
- cmp.internal_comparator = &vset_->icmp_;
- for (int level = 0; level < config::kNumLevels; level++) {
- const std::vector<FileMetaData*>& base_files = base_->files_[level];
- std::vector<FileMetaData*>::const_iterator base_iter = base_files.begin();
- std::vector<FileMetaData*>::const_iterator base_end = base_files.end();
- const FileSet* added = levels_[level].added_files;
- v->files_[level].reserve(base_files.size() + added->size());
- for (FileSet::const_iterator added_iter = added->begin();
- added_iter != added->end();++added_iter) {
-
- for (std::vector<FileMetaData*>::const_iterator bpos
- = std::upper_bound(base_iter, base_end, *added_iter, cmp);
- base_iter != bpos;++base_iter) {
- MaybeAddFile(v, level, *base_iter);
- }
- MaybeAddFile(v, level, *added_iter);
- }
- for (; base_iter != base_end; ++base_iter) {
- MaybeAddFile(v, level, *base_iter);
- }
- }
- }
bpos = std::upper_bound(base_iter,base_end,*added_iter,cmp); // 返回base_iter到base_end之间,第一个大于*added_iter的iter。
假设原有文件的编号为1、3、4、6、8,新增文件的编号为2、5、7,则第一次循环时,bpos为3对应的迭代器,因此base_iter只遍历一个元素,即将编号1加入到新的Version中。
总体对新增文件来说,就是首先加入base中编号比它小的,然后再将其加入,然后再继续比那里下一个新增文件,因此最终得到的文件编号顺序是 1、2、3、4、5、6、7、8,即每一层的.sst文件都是按照编号从小到大排列的。
这样就得到了新的Version的每一层的所有文件。