【翻译】理解LSM树：重度写数据库背后的实现机制

douruiling

已于 2022-07-28 18:02:48 修改

阅读量1.2k

点赞数

文章标签：数据结构

于 2020-11-09 11:07:39 首次发布

原文链接：https://yetanotherdevblog.com/lsm/

版权

原文链接

A log-structured merge-tree (LSM tree) is a data structure typically used when dealing with write-heavy workloads. The write path is optimized by only performing sequential writes. LSM trees are the core data structure behind many databases, including BigTable, Cassandra, Scylla, and RocksDB.

日志结构合并树(LSM树)是一个适用于频繁写场景的数据结构。写路径被优化成顺序写操作。LSM树是许多数据库的核心数据结构，包括BitTable，Cassandra，Scylla和RocksDB。

SSTables

有序字符串表格

LSM trees are persisted to disk using a Sorted Strings Table (SSTable) format. As indicated by the name, SSTables are a format for storing key-value pairs in which the keys are in sorted order. An SSTable will consist of multiple sorted files called segments. These segments are immutable once they are written to disk. A simple example could look like this:

LSM树使用有序字符串表格（SSTable）的格式持久化到硬盘上。如名称所示，SSTable可以将键值对按键排序存储。一个SSTable包含多个有序的文件（Segment）。这些Segment一旦写入磁盘就不再可修改。一个简单的例子如下：
在这里插入图片描述

You can see that the key-value pairs within each segment are sorted by the key. We’ll discuss what exactly a segment is and how it gets generated in the next section.

你能看到每个Segment中的键值对都按键排序。我们将讨论Segment是什么，以及它是如何生成的。

Writing Data

写数据

Recall that LSM trees only perform sequential writes. You may be wondering how we sequentially write our data in a sorted format when values may be written in any order. This is solved by using an in-memory tree structure. This is frequently referred to as a memtable, but the underlying data structure is generally some form of a sorted tree like a red-black tree. As writes come in, the data is added to this red-black tree.

回忆一下，LSM树只可以顺序写。那你可能会问，那当数据是乱序到来的时候，我们如何顺序写且保证数据是排好序的呢？这是依靠内存中的树的结构来解决的。这常常被称作memtable，但实现的数据结构通常是某种排序树，比如红黑树。当写到来时，数据被加入红黑树中。
在这里插入图片描述
Our writes get stored in this red-black tree until the tree reaches a predefined size. Once the red-black tree has enough entries, it is flushed to disk as a segment on disk in sorted order. This allows us to write the segment file as a single sequential write even though the inserts may occur in any order.

我们写入排序的红黑树中，直到树的大小已满，它就被刷到硬盘上，成为一个排好序的Segment文件。这就让我们可以把Segment文件以一次顺序写的方式写入，尽管数据的到来是乱序的。
在这里插入图片描述

Reading Data

读数据

So how do we find a value in our SSTable? A naive approach would be to scan the segments for the desired key. We would start with the newest segment and work our way back to the oldest segment until we find the key that we’re looking for. This would mean that we are able to retrieve keys that were recently written more quickly. A simple optimization is to keep an in-memory sparse index.

所以我们如何在SSTable中找到一个值呢？一个简单的方法是扫描所有的Segment文件直到找到要找的键。我们从最新的Segment开始查找，然后再找旧的，直到我们找到为止。这意味着最近写的数据查起来更快。一个简单的优化方式是维护一个内存中的稀疏索引。

We can use this index to quickly find the offsets for values that would come before and after the key we want. Now we only have to scan a small portion of each segment file based on those bounds. For example, let’s consider a scenario where we want to look up the key dollar in the segment above. We can perform a binary search on our sparse index to find that dollar comes between dog and downgrade. Now we only need to scan from offset 17208 to 19504 in order to find the value (or determine it is missing).
在这里插入图片描述
我们用这个索引来快速找到Key前后的值的偏移量，从而把查找限定在每个Segment文件的小范围内。比如，我们想要在Segment中查找dollar。我们现在稀疏索引中二分查找到dollar在dog和downgrade之间。之后我们就只需要扫描17208到19504之间的值。

This is a nice improvement, but what about looking up records that do not exist? We will still end up looping over all segment files and fail to find the key in each segment. This is something that a bloom filter can help us out with. A bloom filter is a space-efficient data structure that can tell us if a value is missing from our data. We can add entries to a bloom filter as they are written and check it at the beginning of reads in order to efficiently respond to requests for missing data.

这是一个很好的改进，但如果要查找的某个记录并不存在的话，我们仍然会需要扫描所有的Segment文件，直到所有的查找都失败。那么Bloom Filter就可以帮我们解决这个问题。Bloom Filter可以帮我们过滤掉一些不存在的数据。每当数据被写入的时候，我们就把它加入Bloom Filter中，然后在读之前查找Bloom Filter，如果不存在就立即返回。

Compaction

合并

Over time, this system will accumulate more segment files as it continues to run. These segment files need to be cleaned up and maintained in order to prevent the number of segment files from getting out of hand. This is the responsibility of a process called compaction. Compaction is a background process that is continuously combining old segments together into newer segments.

随着时间，这个系统会累积越来越多的Segment文件。这些文件需要被维护和清理，以防止其数量变得过于庞大。合并进程会负责这件事。合并是一个后台进程，持续的把旧的Segment合并成新的Segment。
在这里插入图片描述

You can see in the example above that segments 1 and 2 both have a value for the key dog. Newer segments contain the latest values written, so the value from segment 2 is what gets carried forward into the segment 4. Once the compaction process has written a new segment for the input segments, the old segment files are deleted.

在上述例子中，你可以看到Segment 1和Segment 2同样都有关键字dog。新的Segment文件包含最新写入的值，所以Segment 2的值会写入Segment 4。一旦合并完成，那么旧的Segment文件就被删除。

Deleting Data

删除数据

We’ve covered reading and writing data, but what about deleting data? How do you delete data from the SSTable when the segment files are considered immutable? Deletes actually follow the exact same path as writing data. Whenever a delete request is received, a unique marker called a tombstone is written for that key.

我们谈到了读写数据，那么怎么删除数据呢？在Segment文件不能被修改的情况下要怎么做呢？删除事实上和写是类似的。一旦删除请求到来，就会给这个Key写上一个墓碑标记。
在这里插入图片描述
The example above shows that the key dog had the value 52 at some point in the past, but now it has a tombstone marker. This indicates that if we receive a request for the key dog then we should return a response indicating that the key does not exist. This means that delete requests actually take up disk space initially which many developers may find surprising. Eventually, tombstones will get compacted away so that the value no longer exists on disk.

上述例子显示关键字dog历史上曾被写上值52，但现在它被打上了墓碑标记。这显示如果我们收到关键字dog的读请求，我们就会回答该键已不存在。这意味着删除请求刚开始还会占用硬盘空间。最终，当合并完成，删除的键值才真正从硬盘上抹去。

Conclusion

结论

We now understand how a basic LSM tree storage engine works:

Writes are stored in an in-memory tree (also known as a memtable). Any supporting data structures (bloom filters and sparse index) are also updated if necessary.
When this tree becomes too large it is flushed to disk with the keys in sorted order.
When a read comes in we check the bloom filter. If the bloom filter indicates that the value is not present then we tell the client that the key could not be found. If the bloom filter indicates that the value is present then we begin iterating over our segment files from newest to oldest.
For each segment file, we check a sparse index and scan the offsets where we expect the key to be found until we find the key. We’ll return the value as soon as we find it in a segment file.

现在我们来理解一下基本的LSM树是如何工作的：

写先被写入内存树（memtable）。相应的辅助数据结构（Bloom Filter和Sparse Index）都得到更新。
当树足够大时，它的键值对被顺序刷入硬盘。
当读到来时，我们检查Bloom Filter。如果Bloom Filter显示这个值不存在，那我们就直接返回不存在。否则，那我们就从最新的Segment文件往后查找。
对每个Segment文件，我们都可以检查Sparse Index，然后缩小扫描范围，直到找到为止，并返回。