Roaring Bitmap 原理

最新推荐文章于 2024-06-20 10:23:01 发布

原创最新推荐文章于 2024-06-20 10:23:01 发布 · 371 阅读

1 ·

CC 4.0 BY-SA版权

本文为博主原创文章，未经博主允许不得转载。

文章标签：

#数据结构

RoaringBitmaps是一种压缩位图技术，用于高效存储和操作整数集合。它通过将32位整数按高16位分桶，使用不同类型的容器（如ArrayContainer和BitmapContainer）根据数据量进行压缩存储，解决了传统Bitmaps的空间浪费问题。RoaringBitmaps支持快速的交集、并集和存在性检查操作，算法根据涉及的容器类型有所不同，优化了空间效率和计算速度。

一. bitmaps 是干什么的?

bitmap 是一个比特数组:Array[Byte], 用来存储整数集合:Set[Integer].它通过"如果集合中有一个整数n,就设置arr[n]=1 bit"来存放整数.
由于 bitmap 的这种表达整数的方式, 它可以利用 cpu 的 bitwise-and (按位与) 和 bitwise-or (按位或) 很快的进行"2个整数集合求交集,并集"操作, 时间复杂度O(1)
假设有10亿个文档, 编号从 1 到 10亿.现在要算出同时存在单词 carrier 和单词 pigeon 的文档该怎么做?
可以分别将存在单词 carrier 的文档编号集合用 arr1:Array[Byte] 表示, 存在单词 pigeon 的文档编号集合用 arr2:Array[Byte] 表示; 同时存在两个单词的文档集合就是将这两个比特数组按位与
普通的 bitmaps 有一个缺陷: 当整数数组最大值很大, 但是元素个数却很少时, 会造成巨量的空间浪费.
比如: [1,1000000000] 这个数组, 只有2个整数, 却要用 10亿个bit的空间表示这个整数数组

二. Roaring bitmaps 是干什么的?

Roaring bitmaps 在传统 bitmaps 上, 使用压缩解决数组稀疏问题.具体上讲, Roaring bitmaps 将1个 32 位整数集合, 按照高 16 位分桶(container),最多可分 $2^{16}=65536$ 个桶. 存储整数时，按照整数的高16位找到container（找不到就会新建一个），再将整数的低16位放入 container 中. 常见的 container 有一下2类:

ArrayContainer
当桶内数据的个数不大于4096时，会采用它来存储，其本质上是一个unsigned short类型(正好 16 位)的有序数组:Array[Short]。数组初始长度为4，随着数据的增多会自动扩容（但数组的最大长度就是4096, 即 ArrayContainer 最大占用从初始的 4 * 2B=8B, 到最大 4096 * 2B = 8KB）。另外还维护有一个计数器，用来实时记录基数。
BitmapContainer
当桶内数据的个数大于4096时，会采用它来存储，其本质上是长度固定为 $2^{16}$ 位（8KB）的传统 bitmap (存储 $2^{16}$ 个整数) 1物理表现为 长度固定为 1024 的 unsigned long型(64位,8B)数组:Array[Long] (size=1024)，亦即这些位图的大小固定 8KB。它同样有一个计数器。

三. Roaring bitmaps 的 exist, union, intersect 如何计算?

判断整数 N 是否存在集合中
To check if an integer N exists, get N’s 16 most significant bits (N / 2^16) and use it to find N’s corresponding container in the Roaring bitmap.

If the container doesn’t exist, then N is not in the Roaring bitmap.

Checking for existence in array and bitmap containers works differently:

Bitmap: check if the bit at N % 2^16 is set.
Array: use binary search to find N % 2^16 in the sorted array.
Intersect matching containers to intersect two Roaring bitmaps. Algorithms vary by container type(s), and container types may change.

计算 intersect
To intersect Roaring bitmaps A and B, it is sufficient to intersect matching containers in A and B.

This is possible because of how integers are partitioned in Roaring bitmaps: matching containers in A and B store integers with the same 16 most significant bits (the same chunks).

Intersection algorithms vary by the types of the containers involved, as do the resulting container types:

Bitmap / Bitmap: Compute the bitwise AND of the two bitmaps. If the cardinality is <= 4,096, store the result in an array container, otherwise store it in a bitmap container.
Bitmap / Array: Iterate over the array, checking for the existence of each 16-bit integer in the bitmap. If the integer exists, add it to the resulting array container – note that intersections of bitmap and array container types will always create an array container.
Array / Array: Intersections of two array containers always create a new array container. The algorithm used to compute the intersection varies by a cardinality heuristic described at the bottom of page 5 here. It will either be a simple merge (as used in merge sort) or a galloping intersection, described in this paper.
If there is a container in either Roaring bitmap without a corresponding container in the other, it will not exist in the result: the intersection of an empty set and any set is an empty set.

计算 union
Union matching containers to produce a Roaring bitmap union. Algorithms vary by container type(s), and container types may change.
To union Roaring bitmaps A and B, union all matching containers in A and B.

Union algorithms vary by the container types involved, as do the resulting container types:

Bitmap / Bitmap: Compute the bitwise OR of the two bitmaps. Unions of two bitmap containers will always create another bitmap container.
Bitmap / Array: Copy the bitmap and set corresponding bits for all the integers in the array container. Unions of a bitmap and array container will always create another bitmap container.
Array / Array: If the sum of the cardinalities of the two array containers is <= 4,096, the resulting container will be an array container. In this case, add all integers from both arrays to a new array container. Otherwise, optimistically assume the resulting container will be a bitmap: create a new bitmap container and set all corresponding bits for all integers in both arrays. If the cardinality of the resulting container is <= 4,096, convert the bitmap container back into an array container.
Finally, add all containers in A and B that do not have a matching container to the result. Remember: this is a union, so all integers in Roaring bitmaps A and B must be in the resulting set.