前言
上周日承诺的转载又双叒叕鸽了,本周六原创更新继续分享redis相关内容。
其实今天分享这三块内容源头上一样的东西,是线性表。线性表是最基础的一种数据结构,不管哪门语言,哪所大学,学完hello world就是学习线性表。在redis中对线性表的优化下了蛮多功夫,有些思想值得去了解下。
线性表简单回顾
线性表主要分为顺序表和链表。顺序表就是一大块儿连续的存储空间,比如数组,一大块儿连续的存储空间,它的相对位置是连续的,寻址是O(1)复杂度。但是有些插入删除操作,涉及到后面所有元素的挪动与修改,最坏情况是O(n)复杂度。
链表每个节点记录好它的后继节点。稍稍不同的还可以根据需求记录其他信息,比如双向链表会记录前驱节点。 会每个节点加多许多额外的存储信息。
这里面插入和删除都是O(1),但是找到单个节点需要O(n)的复杂度。范围查找可以使用跳表,这个后续的数据结构里会说。
不管什么样的数据结构,只要是线性存储,都是这两种情形的延伸,只是在不同的场景下需要有针对性的选择,针对性的优化。
adlist 双向链表
https://github.com/redis/redis/blob/6.0/src/adlist.h
typedef struct listNode { struct listNode *prev; struct listNode *next; void *value;} listNode;typedef struct listIter { listNode *next; int direction;} listIter;typedef struct list { listNode *head; listNode *tail; void *(*dup)(void *ptr); void (*free)(void *ptr); int (*match)(void *ptr, void *key); unsigned long len;} list;
定义了节点,迭代器,list 结构。
list结构里 包含了 头尾节点, 复制方法、释放方法、比较方法、和长度。
https://github.com/redis/redis/blob/6.0/src/adlist.c
头插:
/* Add a new node to the list, to head, containing the specified 'value' * pointer as value. * * On error, NULL is returned and no operation is performed (i.e. the * list remains unaltered). * On success the 'list' pointer you pass to the function is returned. */list *listAddNodeHead(list *list, void *value){ listNode *node; if ((node = zmalloc(sizeof(*node))) == NULL) return NULL; node->value = value; if (list->len == 0) { list->head = list->tail = node; node->prev = node->next = NULL; } else { node->prev = NULL; node->next = list->head; list->head->prev = node; list->head = node; } list->len++; return list;}/*
逻辑很简单,就是普通链表操作,新申请一个Node,分配空间,赋值。
如果是第一个节点,前驱后继节点是空,头尾都是该节点。
否则前驱为空,后继为头,后继的前驱为新节点,新节点为链表头。
查找
/* Search the list for a node matching a given key. * The match is performed using the 'match' method * set with listSetMatchMethod(). If no 'match' method * is set, the 'value' pointer of every node is directly * compared with the 'key' pointer. * * On success the first matching node pointer is returned * (search starts from head). If no matching node exists * NULL is returned. */listNode *listSearchKey(list *list, void *key){ listIter iter; listNode *node; listRewind(list, &iter); while((node = listNext(&iter)) != NULL) { if (list->match) { if (list->match(node->value, key)) { return node; } } else { if (key == node->value) { return node; } } } return NULL;}
就是遍历。。然后调用传入的match方法去比较。如果没有传match方法,然后比node的值。
翻转:
/* Create an iterator in the list private iterator structure */void listRewind(list *list, listIter *li) { li->next = list->head; li->direction = AL_START_HEAD;}void listRewindTail(list *list, listIter *li) { li->next = list->tail; li->direction = AL_START_TAIL;}
双向链表的翻转,翻转迭代器方向就好。
获取下个节点:
listNode *listNext(listIter *iter){ listNode *current = iter->next; if (current != NULL) { if (iter->direction == AL_START_HEAD) iter->next = current->next; else iter->next = current->prev; } return current;}
通过迭代器里面设置的方向判断下个节点。
复制:
/* Duplicate the whole list. On out of memory NULL is returned. * On success a copy of the original list is returned. * * The 'Dup' method set with listSetDupMethod() function is used * to copy the node value. Otherwise the same pointer value of * the original node is used as value of the copied node. * * The original list both on success or error is never modified. */list *listDup(list *orig){ list *copy; listIter iter; listNode *node; if ((copy = listCreate()) == NULL) return NULL; copy->dup = orig->dup; copy->free = orig->free; copy->match = orig->match; listRewind(orig, &iter); while((node = listNext(&iter)) != NULL) { void *value; if (copy->dup) { value = copy->dup(node->value); if (value == NULL) { listRelease(copy); return NULL; } } else value = node->value; if (listAddNodeTail(copy, value) == NULL) { listRelease(copy); return NULL; } } return copy;}
遍历,然后尾插一个当前值的节点。
如果指定了dup方法可以用指定的方法计算值。
释放
/* Free the whole list. * * This function can't fail. */void listRelease(list *list){ listEmpty(list); zfree(list);}/* Remove all the elements from the list without destroying the list itself. */void listEmpty(list *list){ unsigned long len; listNode *current, *next; current = list->head; len = list->len; while(len--) { next = current->next; if (list->free) list->free(current->value); zfree(current); current = next; } list->head = list->tail = NULL; list->len = 0;}
释放比较简单。就是先释放掉每个节点,然后最后清理掉list的空间。
ziplist
概述
adlist还是比较简单的,都是基本的链表操作。与此同时,链表的副作用也比较明显,比如操作时候有比较多的碎片空间,以及保存了很多的额外信息。所以在较小的list时,redis采用了另一种结构,压缩list。
其实严格意义上来说,它并不是一个链表。它是统一分配了一大块内存。然后把所有的内容都相邻着存进这块内存中。没有链表的前驱后继指针信息。但是和数组不同的是,它每个元素分配的内存大小也是不同的。以及它可以以O(1)从两端增减内容。
ziplist 与其他数据结构的转换
新创建一个list时候,默认是使用的ziplist。当达到阈值限制后,会自动转化为对应的adlist。这里是一些默认的阈值,在redis.conf里可以配置。
list-max-ziplist-entries 512 list-max-ziplist-value 64
类似的有:
hash-max-ziplist-entries 512 hash-max-ziplist-value 64 zset-max-ziplist-entries 128 zset-max-ziplist-value 64
这个配置的意思是:当发生以下两条之一的时候,ziplist当满足以下之一的时候,会自动转为adlist / dict 等。
数据个数超过512个的时候
2.单个value长度超过64的时候
这是因为在ziplist在变大以后,每次操作带来的realloc,更容易引发内存拷贝,同时每次内存拷贝会拷贝更大的一块数据,使得性能变低。
原理与重要方法实现
https://github.com/redis/redis/blob/6.0/src/ziplist.h
https://github.com/redis/redis/blob/6.0/src/ziplist.c
其实ziplist.c 上面的注释讲的非常清楚,将近注释了200行,这里主要是结合源码的注释进行说明。我把源代码的注释贴在这。
/* The ziplist is a specially encoded dually linked list that is designed * to be very memory efficient. It stores both strings and integer values, * where integers are encoded as actual integers instead of a series of * characters. It allows push and pop operations on either side of the list * in O(1) time. However, because every operation requires a reallocation of * the memory used by the ziplist, the actual complexity is related to the * amount of memory used by the ziplist. * * ---------------------------------------------------------------------------- * * ZIPLIST OVERALL LAYOUT * ====================== * * The general layout of the ziplist is as follows: * * ... * * NOTE: all fields are stored in little endian, if not specified otherwise. * * is an unsigned integer to hold the number of bytes that * the ziplist occupies, including the four bytes of the zlbytes field itself. * This value needs to be stored to be able to resize the entire structure * without the need to traverse it first. * * is the offset to the last entry in the list. This allows * a pop operation on the far side of the list without the need for full * traversal. * * is the number of entries. When there are more than * 2^16-2 entries, this value is set to 2^16-1 and we need to traverse the * entire list to know how many items it holds. * * is a special entry representing the end of the ziplist. * Is encoded as a single byte equal to 255. No other normal entry starts * with a byte set to the value of 255. * * ZIPLIST ENTRIES * =============== * * Every entry in the ziplist is prefixed by metadata that contains two pieces * of information. First, the length of the previous entry is stored to be * able to traverse the list from back to front. Second, the entry encoding is * provided. It represents the entry type, integer or string, and in the case * of strings it also represents the length of the string payload. * So a complete entry is stored like this: * * * * Sometimes the encoding represents the entry itself, like for small integers * as we'll see later. In such a case the part is missing, and we * could have just: * * * * The length of the previous entry, , is encoded in the following way: * If this length is smaller than 254 bytes, it will only consume a single * byte representing the length as an unsinged 8 bit integer. When the length * is greater than or equal to 254, it will consume 5 bytes. The first byte is * set to 254 (FE) to indicate a larger value is following. The remaining 4 * bytes take the length of the previous entry as value. * * So practically an entry is encoded in the following way: * * * * Or alternatively if the previous entry length is greater than 253 bytes * the following encoding is used: * * 0xFE <4 bytes unsigned little endian prevlen> * * The encoding field of the entry depends on the content of the * entry. When the entry is a string, the first 2 bits of the encoding first * byte will hold the type of encoding used to store the length of the string, * followed by the actual length of the string. When the entry is an integer * the first 2 bits are both set to 1. The following 2 bits are used to specify * what kind of integer will be stored after this header. An overview of the * different types and encodings is as follows. The first byte is always enough * to determine the kind of entry. * * |00pppppp| - 1 byte * String value with length less than or equal to 63 bytes (6 bits). * "pppppp" represents the unsigned 6 bit length. * |01pppppp|qqqqqqqq| - 2 bytes * String value with length less than or equal to 16383 bytes (14 bits). * IMPORTANT: The 14 bit number is stored in big endian. * |10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes * String value with length greater than or equal to 16384 bytes. * Only the 4 bytes following the first byte represents the length * up to 2^32-1. The 6 lower bits of the first byte are not used and * are set to zero. * IMPORTANT: The 32 bit number is stored in big endian. * |11000000| - 3 bytes * Integer encoded as int16_t (2 bytes). * |11010000| - 5 bytes * Integer encoded as int32_t (4 bytes). * |11100000| - 9 bytes * Integer encoded as int64_t (8 bytes). * |11110000| - 4 bytes * Integer encoded as 24 bit signed (3 bytes). * |11111110| - 2 bytes * Integer encoded as 8 bit signed (1 byte). * |1111xxxx| - (with xxxx between 0001 and 1101) immediate 4 bit integer. * Unsigned integer from 0 to 12. The encoded value is actually from * 1 to 13 because 0000 and 1111 can not be used, so 1 should be * subtracted from the encoded 4 bit value to obtain the right value. * |11111111| - End of ziplist special entry. * * Like for the ziplist header, all the integers are represented in little * endian byte order, even when this code is compiled in big endian systems. * * EXAMPLES OF ACTUAL ZIPLISTS * =========================== * * The following is a ziplist containing the two elements representing * the strings "2" and "5". It is composed of 15 bytes, that we visually * split into sections: * * [0f 00 00 00] [0c 00 00 00] [02 00] [00 f3] [02 f6] [ff] * | | | | | | * zlbytes zltail entries "2" "5" end * * The first 4 bytes represent the number 15, that is the number of bytes * the whole ziplist is composed of. The second 4 bytes are the offset * at which the last ziplist entry is found, that is 12, in fact the * last entry, that is "5", is at offset 12 inside the ziplist. * The next 16 bit integer represents the number of elements inside the * ziplist, its value is 2 since there are just two elements inside. * Finally "00 f3" is the first entry representing the number 2. It is * composed of the previous entry length, which is zero because this is * our first entry, and the byte F3 which corresponds to the encoding * |1111xxxx| with xxxx between 0001 and 1101. We need to remove the "F" * higher order bits 1111, and subtract 1 from the "3", so the entry value * is "2". The next entry has a prevlen of 02, since the first entry is * composed of exactly two bytes. The entry itself, F6, is encoded exactly * like the first entry, and 6-1 = 5, so the value of the entry is 5. * Finally the special entry FF signals the end of the ziplist. * * Adding another element to the above string with the value "Hello World" * allows us to show how the ziplist encodes small strings. We'll just show * the hex dump of the entry itself. Imagine the bytes as following the * entry that stores "5" in the ziplist above: * * [02] [0b] [48 65 6c 6c 6f 20 57 6f 72 6c 64] * * The first byte, 02, is the length of the previous entry. The next * byte represents the encoding in the pattern |00pppppp| that means * that the entry is a string of length , so 0B means that * an 11 bytes string follows. From the third byte (48) to the last (64) * there are just the ASCII characters for "Hello World". * * ---------------------------------------------------------------------------- * * Copyright (c) 2009-2012, Pieter Noordhuis * Copyright (c) 2009-2017, Salvatore Sanfilippo * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * * Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * Neither the name of Redis nor the names of its contributors may be used * to endorse or promote products derived from this software without * specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */
注释的第16行
* <zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>
说了宏观上ziplist的存储方式,是连续的一段内存区域。
zlbytes表示该ziplist占用的字节总数 ,占32位
zltail表示该ziplist总的偏移字节数 ,占32位
zllen表示数据中元素的个数 ,占16位
entry 每一个entry是一个元素
zlend 固定的结束标记,值等于255. ,占8位
如果没有指定,都是小端存储。
注释的47行 讲了entry的构成
* <prevlen> <encoding> <entry-data>
每一个entry由以下三部分构成
entry-data 就是存储的数据内容
prevlen 表示前一个元素的长度,这样可以在找它前一个元素的时候,快速找到,只需偏移prevlen的长度就ok了,这个参数是变长的参数。
encoding 表示前面数据的编码.
prevlen的变长规则如下:
如果前面的元素,长度小于254个字节, 那么prevlen这里用unsinged 8 bit integer。
当长度大于或等于254,将消耗5个字节。第一个字节设置为254(FE)表示后面有一个较大的值,总体结构如下。
0xFE <4 bytes unsigned little endian prevlen> <encoding> <entry>
encoding部分:
根据entry的内容不同,有以下N多种情况:(截取自注释80~106行)
* |00pppppp| - 1 byte * String value with length less than or equal to 63 bytes (6 bits). * "pppppp" represents the unsigned 6 bit length. * |01pppppp|qqqqqqqq| - 2 bytes * String value with length less than or equal to 16383 bytes (14 bits). * IMPORTANT: The 14 bit number is stored in big endian. * |10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes * String value with length greater than or equal to 16384 bytes. * Only the 4 bytes following the first byte represents the length * up to 2^32-1. The 6 lower bits of the first byte are not used and * are set to zero. * IMPORTANT: The 32 bit number is stored in big endian. * |11000000| - 3 bytes * Integer encoded as int16_t (2 bytes). * |11010000| - 5 bytes * Integer encoded as int32_t (4 bytes). * |11100000| - 9 bytes * Integer encoded as int64_t (8 bytes). * |11110000| - 4 bytes * Integer encoded as 24 bit signed (3 bytes). * |11111110| - 2 bytes * Integer encoded as 8 bit signed (1 byte). * |1111xxxx| - (with xxxx between 0001 and 1101) immediate 4 bit integer. * Unsigned integer from 0 to 12. The encoded value is actually from * 1 to 13 because 0000 and 1111 can not be used, so 1 should be * subtracted from the encoded 4 bit value to obtain the right value. * |11111111| - End of ziplist special entry.
感受
以上就是数据结构定义的完全体。理解了这个数据结构,后面看到它操作增删改查,就灰常容易理解了。
小邢硬着头皮看完这个注释,只觉得设计的蛮精巧。其实这个ziplist,本质上是开辟了连续的内存,因为变长故不能像数组一样直接捞出来对应的value。
然后redis这边的实现,也是用了一个prevlen,和encoding这样的一种形式,可以在内存中前后自由的寻址。其实这个就和 前驱指针和后继指针思想是一样的有木有!!只不过adlist里面是指针,这里描述的是节点的长度,真の殊途同归。
所以这种方式解决了链表内存东一块西一块的问题,也省去了一些不必要的开销。当然,连续内存的最大弊端就是增减内容时候O(n),需要把后面的值移过来,ziplist只在小范围内使用,倒也还好。
一些重要接口 以及 quicklist
在下篇中继续更新。
总结
ziplist的一些重要接口实现,还有adlist和ziplist的结合产物——quicklist由于篇幅的关系在下篇中继续阐述。(太晚了,图书馆要闭馆赶人了,(笑)
其实redis中很多数据结构都是这样的形式,不光是线性表。在数据量小的时候,和数据量大的时候,用不同的数据结构,做成不同的数据工具去存储,在我们日常开发中也可以借鉴这种思想。每种结构都有它的优缺点和适宜场景,根据项目实际情况,灵活的运用。

微信号|xiaoxingblog
欢迎关注小邢的技术博客!