?
- how we can store the data that we’re given(如何存储数据)?
- how we can find it again when we’re asked for it(如何检索数据)?
- storage engine optimized for transactional(
OLTP
) 和 optimized for analytics(OLAP
) 有何不同? - log-structured storage engines(如
LSM-Trees
) and page-oriented storage enginers(如B-Trees
)? - index 的类型和原理?
OLTP(online transaction processing) Overview
A transaction needn’t necessarily have ACID (atomicity, consis‐ tency, isolation, and durability) properties. Transaction processing just means allowing clients to make low-latency reads and writes— as opposed to batch processing jobs, which only run periodically (for example, once per day).
Storage:
相比随机写入顺序写入简单高效, 因此 Many databases internally use a log(an append-only sequence of records), which is an append-only
(the simplest possible write operation) data file.
Retrieval:
TP 应用通常面向用户,使用 key 检索数据,结果集也通常较小,对响应时间有要求,因此 store engine 使用 index 来加速查询。Disk seek time is often the bottleneck here.
Index:
Index 利用额外的存储(来自 data)来加速查询. 维持 index 会面临一些成本,尤其是当写入时,需要同步更新 index。由于 index 可以加速查询, 降低写入效率,所以索引的建立需开发人员去权衡.
Hash Indexes with append-only
Storage:
数据以 (k,v) pair 的形式存储在磁盘,以 append only 的方式进行追加
Retrieval:
在内存中维护 hash index,数据追加时同步更新 index
以 append-only 方式追加会导致磁盘空间不足,为了解决这个问题:
- 将 log 切分为定长的 segment, 当 segment file 到达固定大小时,后续的 log 写入新的 segment file
- 定期对 segment file 进行 compaction(throwing away duplicate keys in the log, and keeping only the most recent update for each key),也可以将多个 segment file compaction 为一个
- segment file 是只读的, compaction 会产生新的文件.在 compaction 的同时, old segment file 仍可以提供读写服务.当 compaction 完成时,使用新的 segment file, 删除老的 segment file。
这个方案一些要关注的点:
- File format: use a binary format that first encodes the length of a string in bytes, followed by the raw string
- Deleting records: use a binary format that first encodes the length of a string in bytes, followed by the raw string. merge 的时候会删除数据.
Crash recovery
: Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.Partially written records
: include checksums, allowing such corrupted parts of the log to be detected and ignored.Concurrency control
: 单线程写入,一个 file 要么是 append-only 要么是 immutable, 所以可以多线程读取
优势:
- append 和 segment mege 是顺序写入操作,比随机写入要高效
- Concurrency and crash recovery are much simpler if segment files are append-only or immutable
- Merging old segments avoids the problem of data files getting fragmented over time.
Limitations:
- The hash table must fit in memory,当 hash table 很
大被迫放入磁盘时,索引效率就会下降,还会面临 hash 冲突的问题. - Range queries are not efficient.