Note: CHAPTER 3 Storage and Retrieval_document storage and retrieval-优快云博客

本文链接：https://blog.youkuaiyun.com/haiboself/article/details/100862668

本文深入探讨了在线事务处理(OLTP)和在线分析处理(OLAP)的存储引擎优化，重点介绍了Hash索引、SSTables与LSM-Trees、B-Trees以及列存储的优势和优化。OLTP中，LSM-Trees通过顺序写入和合并优化了写入性能，但可能导致读取变慢；而B-Trees则提供了高效的查找和范围查询，适用于事务处理。在OLAP场景下，列存储和数据立方体用于高效分析大量数据。文章还讨论了各种索引结构的优缺点以及如何平衡读写性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

?

how we can store the data that we’re given(如何存储数据)?
how we can find it again when we’re asked for it(如何检索数据)?
storage engine optimized for transactional(OLTP) 和 optimized for analytics(OLAP) 有何不同?
log-structured storage engines(如 LSM-Trees) and page-oriented storage enginers(如 B-Trees)?
index 的类型和原理?

OLTP(online transaction processing) Overview

A transaction needn’t necessarily have ACID (atomicity, consis‐ tency, isolation, and durability) properties. Transaction processing just means allowing clients to make low-latency reads and writes— as opposed to batch processing jobs, which only run periodically (for example, once per day).

Storage: 相比随机写入顺序写入简单高效, 因此 Many databases internally use a log(an append-only sequence of records), which is an append-only(the simplest possible write operation) data file.

Retrieval: TP 应用通常面向用户,使用 key 检索数据,结果集也通常较小,对响应时间有要求,因此 store engine 使用 index 来加速查询。Disk seek time is often the bottleneck here.

Index: Index 利用额外的存储(来自 data)来加速查询. 维持 index 会面临一些成本,尤其是当写入时,需要同步更新 index。由于 index 可以加速查询, 降低写入效率,所以索引的建立需开发人员去权衡.

Hash Indexes with append-only

Storage: 数据以 (k,v) pair 的形式存储在磁盘,以 append only 的方式进行追加
Retrieval: 在内存中维护 hash index,数据追加时同步更新 index

以 append-only 方式追加会导致磁盘空间不足,为了解决这个问题:

将 log 切分为定长的 segment, 当 segment file 到达固定大小时,后续的 log 写入新的 segment file
定期对 segment file 进行 compaction(throwing away duplicate keys in the log, and keeping only the most recent update for each key),也可以将多个 segment file compaction 为一个
segment file 是只读的, compaction 会产生新的文件.在 compaction 的同时, old segment file 仍可以提供读写服务.当 compaction 完成时,使用新的 segment file, 删除老的 segment file。

这个方案一些要关注的点:

File format: use a binary format that first encodes the length of a string in bytes, followed by the raw string
Deleting records: use a binary format that first encodes the length of a string in bytes, followed by the raw string. merge 的时候会删除数据.
Crash recovery: Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.
Partially written records: include checksums, allowing such corrupted parts of the log to be detected and ignored.
Concurrency control: 单线程写入,一个 file 要么是 append-only 要么是 immutable, 所以可以多线程读取

优势:

append 和 segment mege 是顺序写入操作,比随机写入要高效
Concurrency and crash recovery are much simpler if segment files are append-only or immutable
Merging old segments avoids the problem of data files getting fragmented over time.

Limitations: