Roaring Bitmap 原理及实践

最新推荐文章于 2025-06-04 19:28:24 发布

原创最新推荐文章于 2025-06-04 19:28:24 发布 · 置顶 · 1.9k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#java

大数据基础知识专栏收录该内容

13 篇文章

订阅专栏

本文介绍了RoaringBitmap库，一种针对整数集合的高效压缩数据结构，通过集合运算优化和空间管理减少存储需求，尤其适合大数据场景。示例展示了如何使用RoaringBitmap进行交并差操作，并讨论了与传统Set的对比及其在存储节省上的优势。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Roaring Bitmap 介绍

存储整数集（类似于存储整数的 Set），并支持整数集的集合运算（交并差等）。之所以在一些场景下使用 Roaring Bitmap，而不是 Set，是因为其节省存储空间以及集合运算效率更优。

使用示例

<dependencies>
    <dependency>
        <groupId>org.roaringbitmap</groupId>
        <artifactId>RoaringBitmap</artifactId>
        <version>0.9.0</version>
    </dependency>
</dependencies>

package roaringbitmap;

import org.roaringbitmap.RoaringBitmap;
import org.roaringbitmap.longlong.Roaring64Bitmap;

import java.util.Iterator;

public class Test {
    public static void main(String[] args) {
        RoaringBitmap r1 = RoaringBitmap.bitmapOf(15, 0, Integer.MIN_VALUE, Integer.MAX_VALUE, -1, -8);
        print(r1, "r1");

        RoaringBitmap r2 = new RoaringBitmap();
        r2.add(-1);
        r2.add(0);
        r2.add(1);
        print(r2, "r2");

        RoaringBitmap intersect = RoaringBitmap.and(r1, r2);
        print(intersect, "r1-and-r2");

        System.out.println("=================");

        Roaring64Bitmap r64 = Roaring64Bitmap.bitmapOf(1, 2, Long.MAX_VALUE, Long.MIN_VALUE, -1, -3);
        System.out.println("RBM[r64] cardinality: " + r64.getLongCardinality());
        Iterator<Long> iterator = r64.iterator();
        while (iterator.hasNext()) {
            System.out.print(iterator.next() + "\t");
        }
        System.out.println();

    }

    public static void print(RoaringBitmap r, String name) {
        System.out.println("RBM[" + name + "] cardinality: " + r.getLongCardinality());
        for(int i: r) {
            System.out.print(i + "\t");
        }
        System.out.println();
    }


}
/*
RBM[r1] cardinality: 6 
0        15        2147483647        -2147483648        -8        -1         
RBM[r2] cardinality: 3 
0        1        -1         
RBM[r1-and-r2] cardinality: 2 
0        -1         
================= 
RBM[r64] cardinality: 6 
1        2        9223372036854775807        -9223372036854775808        -3        -1    
*/

Bitmap/Bitset

问：1 亿个设备 id（imeimd5, 长度 32 字符串, 512 bit）需要多少存储空间？
答：1 亿 * 64 Byte = 64 亿 Byte = 6.4 G

但是，如果一个设备 id 使用 1 bit 存储，需要存储空间是 6.4G / 512 = 12.5 M

set = {1, 2, 15}
在这里插入图片描述

package roaringbitmap;

public class Test {
    public static void main(String[] args) {
        byte[] bits = new byte[2];

        add(bits, 1);
        add(bits, 2);
        System.out.println(Integer.toBinaryString(bits[0]));    // 110
    }

    public static void add(byte[] bits, int num){
        byte pre = bits[num / 8];
        byte newVal = (byte) (pre | (1 << (num % 8)));
        bits[num / 8] = newVal;
    }
}

缺点：稀疏数据存储依旧浪费空间，举一个极端例子，用位图只存储数字 0 和 1599（8*200 - 1），至少需要开辟长度为 200 的 byte 数组，而普通 set 存储 0 和 1599 两个 Integer 只需要 8 字节！

Roaring Bitmap 理论

We partition the range of 32-bit indexes ( [0,n) ) into chunks of 2^16 integers sharing the same 16 most significant digits. We use specialized containers to store their 16 least significant bits.
When a chunk contains no more than 4096 integers, we use a sorted array of packed 16-bit integers. When there are more than 4096 integers, we use a 2^16 -bit bitmap. Thus, we have two types of containers: an array container for sparse chunks and a bitmap container for dense chunks.

在这里插入图片描述

根据实际数据状况，桶动态生成（而不是一开始就建立所有桶，避免不必要的空间浪费）
根据实际数据状况，若桶中数据少（稀疏数据），使用 array 存；若桶中数据多，使用 bitmap 存
为了设计及实现简单，bitmap 大小固定为 2^16 位（8KB）。4096 个 Integer（低 16 位）整好占 8KB 存储，为了节省空间，桶中数据少于 4096 个数时，使用 array，否则使用 bitmap。

Our new addition, the run container, is made of a packed array of pairs of 16-bit integers. The first value of each pair represents a starting value, whereas the second value is the length of a run. For example, we would store the values 11,12,13,14,15 as the pair 11,4 where 4 means that beyond 11 itself,there are 4 contiguous values that follow.

Roaring Bitmap 源码

/**
 * RoaringBitmap, a compressed alternative to the BitSet.
 *
 * <pre>
 * {@code
 *      import org.roaringbitmap.*;
 *
 *      //...
 *
 *      RoaringBitmap rr = RoaringBitmap.bitmapOf(1,2,3,1000);
 *      RoaringBitmap rr2 = new RoaringBitmap();
 *      for(int k = 4000; k<4255;++k) rr2.add(k);
 *      RoaringBitmap rror = RoaringBitmap.or(rr, rr2);
 *
 *      //...
 *      DataOutputStream wheretoserialize = ...
 *      rr.runOptimize(); // can help compression
 *      rr.serialize(wheretoserialize);
 * }
 * </pre>
 *
 * Integers are added in unsigned sorted order. That is, they are treated as unsigned integers (see
 * Java 8's Integer.toUnsignedLong function).
 * Up to 4294967296 integers
 * can be stored.
 *
 *
 *
 */
 
 
 /**
 * Roaring64Bitmap is a compressed 64 bit bitmap. It can contain all the numbers of long
 * rang[Long.MAX_VALUE, Long.MIN_VALUE]. Since java has no unsigned long,we treat the negative value
 * as a successor of the positive value. So the ascending ordering of all the long value is:
 * 0,1,2...Long.MAX_VALUE,Long.MIN_VALUE,Long.MIN_VALUE+1.......-1. See Long.toUnsignedString()
 */

关键类图

在这里插入图片描述

部分源码

/**
    RoaringBitmap 初始化
*/

RoaringBitmap r2 = new RoaringBitmap();

// step 1
public RoaringBitmap() {
  highLowContainer = new RoaringArray();
}
// step 2
protected RoaringArray() {
  this(INITIAL_CAPACITY);    // INITIAL_CAPACITY = 4
}
// step 3
RoaringArray(int initialCapacity) {
  this(new char[initialCapacity], new Container[initialCapacity], 0);
}
// step 4
RoaringArray(char[] keys, Container[] values, int size) {
  this.keys = keys;
  this.values = values;
  this.size = size;
}

/**
    org.roaringbitmap.RoaringBitmap#add(int)
*/

public void add(final int x) {
  final char hb = Util.highbits(x);
  final int i = highLowContainer.getIndex(hb);
  if (i >= 0) {
    highLowContainer.setContainerAtIndex(i,
        highLowContainer.getContainerAtIndex(i).add(Util.lowbits(x)));
  } else {
    final ArrayContainer newac = new ArrayContainer();
    highLowContainer.insertNewKeyValueAt(-i - 1, hb, newac.add(Util.lowbits(x)));
  }
}

// （tips）获取 Integer 高 16 位
protected static char highbits(int x) {
  return (char) (x >>> 16);
}
// （tips）获取 Integer 低 16 位
protected static char lowbits(int x) {
  return (char) x;
}

// （tips）查找等于 k 的位置，或第一个大于 k 的位置
// starts with binary search and finishes with a sequential search
protected static int hybridUnsignedBinarySearch(final char[] array, final int begin,
    final int end, final char k) {
  // next line accelerates the possibly common case where the value would
  // be inserted at the end
  if ((end > 0) && ((array[end - 1]) < (int) k)) {
    return -end - 1;
  }
  int low = begin;
  int high = end - 1;
  // 32 in the next line matches the size of a cache line
  while (low + 32 <= high) {
    final int middleIndex = (low + high) >>> 1;
    final int middleValue = (array[middleIndex]);

    if (middleValue < (int) k) {
      low = middleIndex + 1;
    } else if (middleValue > (int) k) {
      high = middleIndex - 1;
    } else {
      return middleIndex;
    }
  }
  // we finish the job with a sequential search
  int x = low;
  for (; x <= high; ++x) {
    final int val = (array[x]);
    if (val >= (int) k) {
      if (val == (int) k) {
        return x;
      }
      break;
    }
  }
  return -(x + 1);
}

// 查看缓存行大小
// cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size

/**
    org.roaringbitmap.BitmapContainer
*/

@Override
public Container add(final char i) {
  final long previous = bitmap[i >>> 6];
  long newval = previous | (1L << i); // 1 << (i % 64)
  bitmap[i >>> 6] = newval;
  if (USE_BRANCHLESS) {    // optimization flag: whether the cardinality of the bitmaps
                           // is maintained through branchless operations
    cardinality += (int)((previous ^ newval) >>> i);    // 无需判断 previous 是否等于 newval
                                                   // 避免 CPU 分支预测错误带来的性能开销
  } else if (previous != newval) {
    ++cardinality;
  }
  return this;
}


@Override
public boolean contains(final char i) {
  return (bitmap[i >>> 6] & (1L << i)) != 0;
}


@Override
public int getCardinality() {
  return cardinality;
}


@Override
public Container remove(final char i) {
  int index = i >>> 6;
  long bef = bitmap[index];
  long mask = 1L << i;
  if (cardinality == ArrayContainer.DEFAULT_MAX_SIZE + 1) {    // DEFAULT_MAX_SIZE = 4096
    // this is
    // the
    // uncommon
    // path
    if ((bef & mask) != 0) {
      --cardinality;
      bitmap[i >>> 6] = bef & ~mask;    // 存储数 i 对应位置成 0
      return this.toArrayContainer();
    }
  }
  long aft = bef & ~mask;    // 存储数 i 的位由 1 置成 0
  cardinality -= (aft - bef) >>> 63;    // 存储数 i 的位若由 1 置成 0，aft - bef < 0，其最高位必为 1
  bitmap[index] = aft;
  return this;
}


/**
 * Fill the array with set bits
 *
 * @param array container (should be sufficiently large)
 */
void fillArray(final char[] array) {
  int pos = 0;
  int base = 0;
  for (int k = 0; k < bitmap.length; ++k) {
    long bitset = bitmap[k];
    while (bitset != 0) {
      // Long.numberOfTrailingZeros
      // Returns the number of zero bits following the lowest-order ("rightmost") one-bit 
      // in the two's complement binary representation of the specified long value. 
      // Returns 64 if the specified value has no one-bits in its two's complement representation,
      // in other words if it is equal to zero.
      array[pos++] = (char) (base + numberOfTrailingZeros(bitset));
      bitset &= (bitset - 1);    // 把最低 1 位置成 0
    }
    base += 64;
  }
}


@Override
public Container and(final BitmapContainer value2) {
  int newCardinality = 0;
  for (int k = 0; k < this.bitmap.length; ++k) {
    // [2] CPU 实现了快速计算一个处理字中位 1 个数的指令，Long.bitCount 底层基于此指令实现
    newCardinality += Long.bitCount(this.bitmap[k] & value2.bitmap[k]);
  }
  if (newCardinality > ArrayContainer.DEFAULT_MAX_SIZE) {
    final BitmapContainer answer = new BitmapContainer();
    for (int k = 0; k < answer.bitmap.length; ++k) {
      answer.bitmap[k] = this.bitmap[k] & value2.bitmap[k];
    }
    answer.cardinality = newCardinality;
    return answer;
  }
  ArrayContainer ac = new ArrayContainer(newCardinality);
  Util.fillArrayAND(ac.content, this.bitmap, value2.bitmap);
  ac.cardinality = newCardinality;
  return ac;
}


@Override
public ArrayContainer and(final ArrayContainer value2) {
  final ArrayContainer answer = new ArrayContainer(value2.content.length);
  int c = value2.cardinality;
  for (int k = 0; k < c; ++k) {
    char v = value2.content[k];
    answer.content[answer.cardinality] = v;
    answer.cardinality += (int)this.bitValue(v);
  }
  return answer;
}
int bitValue(final char i) {
  return (int)(bitmap[i >>> 6] >>> i ) & 1;
}

/**
    空间压缩（优化），比较当前容器占据空间和使用 RunContainer 占据空间的大小关系
*/
@Override
public Container runOptimize() {
  // 计算一下，如果用 RunContainer 存储，需要多少轮（每轮需要 2 个 char，4 Bytes）
  int numRuns = numberOfRunsLowerBound(MAXRUNS); // decent choice

  int sizeAsRunContainerLowerBound = RunContainer.serializedSizeInBytes(numRuns);

  if (sizeAsRunContainerLowerBound >= getArraySizeInBytes()) {
    return this;
  }
  // else numRuns is a relatively tight bound that needs to be exact
  // in some cases (or if we need to make the runContainer the right
  // size)
  numRuns += numberOfRunsAdjustment();
  int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns);

  if (getArraySizeInBytes() > sizeAsRunContainer) {
    return new RunContainer(this, numRuns);
  } else {
    return this;
  }
}

/**
    org.roaringbitmap.ArrayContainer
*/

// ArrayContainer 动态扩容设计，DEFAULT_INIT_SIZE = 4
private void increaseCapacity(boolean allowIllegalSize) {
  int newCapacity = (this.content.length == 0) ? DEFAULT_INIT_SIZE
      : this.content.length < 64 ? this.content.length * 2
          : this.content.length < 1067 ? this.content.length * 3 / 2
              : this.content.length * 5 / 4;
  // never allocate more than we will ever need
  if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE && !allowIllegalSize) {
    newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
  }
  // if we are within 1/16th of the max, go to max
  if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE - ArrayContainer.DEFAULT_MAX_SIZE / 16
      && !allowIllegalSize) {
    newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
  }
  this.content = Arrays.copyOf(this.content, newCapacity);
}