The OS and the disk
The first thing to consider is what we can expect from a database in terms of durability. In order to do so we can visualize what happens during a simple write operation:(一个写操作会经历一下几个步骤)- 1: The client sends a write command to the database (data is in client's memory)(数据在client的内存中).
- 2: The database receives the write (data is in server's memory)(数据在server的内存中).
- 3: The database calls the system call that writes the data on disk (data is in the kernel's buffer)(数据在操作系统的内核中).
- 4: The operating system transfers the write buffer to the disk controller (data is in the disk cache)(操作系统把数据传递到disk cache中).
- 5: The disk controller actually writes the data into a physical media (a magnetic disk, a Nand chip, ...)(写入硬盘).
When is our write safe along the line?
If we consider a failure that involves just the database software (the process gets killed by the system administrator or crashes) and does not touch the kernel, the write can be considered safe just after the step 3 is completed with success, that is after the write(2) system call (or any other system call used to transfer data to the kernel) returns successfully. After this step even if the database process crashes, still the kernel will take care of transferring data to the disk controller. (总之,如果数据库进程crash的话,但是数据已经在kernel中了,那么数据依然可以保存成功)If we consider instead a more catastrophic event like a power outage, we are safe only at step 5 completion, that is, when data is actually transfered to the physical device memorizing it. (如果断电那么只有在步骤5执行完后才安全)
We can summarize that the important stages in data safety are the 3, 4, and 5. That is:
- How often the database software will transfer its user-space buffers to the kernel buffers using the write (or equivalent) system call?
- How often the kernel will flush the buffers to the disk controller?
- And how often the disk controller will write data to the physical media?
Append only file
The Append Only File, usually called simply AOF, is the main Redis persistence option. The way it works is extremely simple: every time a write operation that modifies the dataset in memory is performed, the operation gets logged. The log is produced exactly in the same format used by clients to communicate with Redis, so the AOF can be even piped via netcat to another instance, or easily parsed if needed. At restart Redis re-plays all the operations to reconstruct the dataset.To show how the AOF works in practice I'll do a simple experiment, setting up a new Redis 2.6 instance with append only file enabled:
./redis-server --appendonly yesNow it's time to send a few write commands to the instance:
redis 127.0.0.1:6379> set key1 Hello OK redis 127.0.0.1:6379> append key1 " World!" (integer) 12 redis 127.0.0.1:6379> del key1 (integer) 1 redis 127.0.0.1:6379> del non_existing_key (integer) 0The first three operations actually modified the dataset, the fourth did not, because there was no key with the specified name. This is how our append only file looks like:
$ cat appendonly.aof *2 $6 SELECT $1 0 *3 $3 set $4 key1 $5 Hello *3 $6 append $4 key1 $7 World! *2 $3 del $4 key1As you can see the final DEL is missing, because it did not produced any modification to the dataset.
It is as simple as that, new commands received will get logged into the AOF, but only if they have some effect on actual data. However not all the commands are logged as they are received. For instance blocking operations on lists are logged for their final effects as normal non blocking commands. Similarly INCRBYFLOAT is logged as SET, using the final value after the increment as payload, so that differences in the way floating points are handled by different architectures will not lead to different results after reloading an AOF file.(上面只是做了一个实验,总之只有实际影响的数据才会记录到aof文件中)
So far we know that the Redis AOF is an append only business, so no corruption is possible. However this desirable feature can also be a problem: in the above example after the DEL operation our instance is completely empty, still the AOF is a few bytes worth of data. The AOF is an always growing file , so how to deal with it when it gets too big(但是如何应对aof文件一直变大呢)?
AOF rewrite
When an AOF is too big Redis will simply rewrite it from scratch in a temporary file. The rewrite is NOT performed by reading the old one, but directly accessing data in memory, so that Redis can create the shortest AOF that is possible to generate, and will not require read disk access while writing the new one.Once the rewrite is terminated, the temporary file is synched on disk with fsync and is used to overwrite the old AOF file.
You may wonder what happens to data that is written to the server while the rewrite is in progress. This new data is simply also written to the old (current) AOF file(新数据依旧写入到老的aof文件中), and at the same time queued into an in-memory buffer(同时命令加入到in-memory buffer队列), so that when the new AOF is ready we can write this missing part inside it, and finally replace the old AOF file with the new one.
As you can see still everything is append only, and when we rewrite the AOF we still write everything inside the old AOF file, for all the time needed for the new to be created. This means that for our analysis we can simply avoid considering the fact that the AOF in Redis gets rewritten at all. So the real question is, how often we write(2), and how often we fsync(2).
AOF rewrites are generated only using sequential I/O operations(都是顺序io操作), so the whole dump process is efficient even with rotational disks (no random I/O is performed). This is also true for RDB snapshots generation. The complete lack of Random I/O accesses is a rare feature among databases, and is possible mostly because Redis serves read operations from memory, so data on disk does not need to be organized for a random access pattern, but just for a sequential loading on restart.
AOF durability(aof 的可靠性)
This whole article was written to reach this paragraph. I'm glad I'm here, and I'm even more glad you are still here with me.The Redis AOF uses an user-space buffer that is populated with new data as new commands are executed. The buffer is usually flushed on disk every time we return back into the event loop , using a single write(2) call against the AOF file descriptor, but actually there are three different configurations that will change the exact behavior of write(2), and especially, of fsync(2) calls.
This three configurations are controlled by the appendfsync configuration directive, that can have three different values: no, everysec, always. This configuration can also be queried or modified at runtime using the CONFIG SET command, so you can alter it every time you want without stopping the Redis instance.
appendfsync no
In this configuration Redis does not perform fsync(2) calls at all. However it will make sure that clients not using pipelining , that is, clients that wait to receive the reply of a command before sending the next one, will receive an acknowledge that the command was executed correctly only after the change is transfered to the kernel by writing the command to the AOF file descriptor, using the write(2) system call .Because in this configuration fsync(2) is not called at all, data will be committed to disk at kernel's wish, that is, every 30 seconds in most Linux systems.
appendfsync everysec
In this configuration data will be both written to the file using write(2) and flushed from the kernel to the disk using fsync(2) one time every second . Usually the write(2) call will actually be performed every time we return to the event loop, but this is not guaranteed.However if the disk can't cope with the write speed, and the background fsync(2) call is taking longer than 1 second, Redis may delay the write up to an additional second (in order to avoid that the write will block the main thread because of an fsync(2) running in the background thread against the same file descriptor). If a total of two seconds elapsed without that fsync(2) was able to terminate, Redis finally performs a (likely blocking) write(2) to transfer data to the disk at any cost.
So in this mode Redis guarantees that, in the worst case, within 2 seconds everything you write is going to be committed to the operating system buffers and transfered to the disk. In the average case data will be committed every second.
appednfsync always
In this mode, and if the client does not use pipelining but waits for the replies before issuing new commands, data is both written to the file and synched on disk using fsync(2) before an acknowledge is returned to the client .This is the highest level of durability that you can get, but is slower than the other modes.
The default Redis configuration is appendfsync everysec that provides a good balance between speed (is almost as fast as appendfsync no ) and durability.
What Redis implements when appendfsync is set to
always is usually called
group commit. This means that instead of using an fsync call for every write operation performed, Redis is able to
group this commits in a single write+fsync operation performed before sending the request to the group of clients that issued a write operation during the latest event loop iteration.
In practical terms it means that you can have hundreds of clients performing write operations at the same time: the fsync operations will be factorized - so even in this mode Redis should be able to support a thousand of concurrent transactions per second while a rotational device can only sustain 100-200 write op/s.
This feature is usually hard to implement in a traditional database, but Redis makes it remarkably more simple.
In practical terms it means that you can have hundreds of clients performing write operations at the same time: the fsync operations will be factorized - so even in this mode Redis should be able to support a thousand of concurrent transactions per second while a rotational device can only sustain 100-200 write op/s.
This feature is usually hard to implement in a traditional database, but Redis makes it remarkably more simple.
fsync函数同步内存中所有已修改的文件数据到储存设备。
调用 fsync 可以保证文件的修改时间也被更新。fsync 系统调用可以使您精确的强制每次写入都被更新到磁盘中。您也可以使用同步(synchronous)I/O 操作打开一个文件,这将引起所有写数据都立刻被提交到磁盘中。通过在 open 中指定 O_SYNC 标志启用同步I/O。
因为 AOF 的运作方式是不断地将命令追加到文件的末尾, 所以随着写入命令的不断增加, AOF 文件的体积也会变得越来越大。
举个例子, 如果你对一个计数器调用了 100 次 INCR , 那么仅仅是为了保存这个计数器的当前值, AOF 文件就需要使用 100 条记录(entry)。
然而在实际上, 只使用一条 SET 命令已经足以保存计数器的当前值了, 其余 99 条记录实际上都是多余的。
为了处理这种情况, Redis 支持一种有趣的特性: 可以在不打断服务客户端的情况下, 对 AOF 文件进行重建(rebuild)。
执行 BGREWRITEAOF 命令, Redis 将生成一个新的 AOF 文件, 这个文件包含重建当前数据集所需的最少命令。
参考文献:
http://redis.readthedocs.org/en/latest/topic/persistence.html(redis持久化命令参考)
http://oldblog.antirez.com/post/redis-persistence-demystified.html(Redis persistence demystified redis持久化揭秘)