BitSet的源码研究

Java BitSet详解

最新推荐文章于 2021-08-18 19:45:15 发布

原创最新推荐文章于 2021-08-18 19:45:15 发布 · 146 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#java #c/c++

java 集合专栏收录该内容

16 篇文章

订阅专栏

这几天看Bloom Filter，因为在java中，并不能像C/C++一样直接操纵bit级别的数据，所以只能另想办法替代：

1）使用整数数组来替代；

2）使用BitSet；

BitSet实际是由“二进制位”构成的一个Vector。如果希望高效率地保存大量“开－关”信息，就应使用BitSet。它只有从尺寸的角度看才有意义；如果希望的高效率的访问，那么它的速度会比使用一些固有类型的数组慢一些。

BitSet的大小与实际申请的大小并不一定一样，BitSet的size方法打印出的大小一定是64的倍数，这与它的实际申请代码有关，假设以下面的代码实例化一个BitSet:

BitSet set = new BitSet(129);

我们来看看实际是如何申请的：申请源码如下：

/**

    * Creates a bit set whose initial size is large enough to explicitly

    * represent bits with indices in the range <code>0</code> through

    * <code>nbits-1</code>. All bits are initially <code>false</code>.

    *

    * @param     nbits   the initial size of the bit set.

    * @exception NegativeArraySizeException if the specified initial size

    *               is negative.

    */

   public BitSet(int nbits) {

   // nbits can't be negative; size 0 is OK

   if (nbits < 0)

       throw new NegativeArraySizeException("nbits < 0: " + nbits);
 
   initWords(nbits);

   sizeIsSticky = true;

   }
 
   private void initWords(int nbits) {

   words = new long[wordIndex(nbits-1) + 1];

   }

实际的空间是由initWords方法控制的，在这个方法里面，我们实例化了一个long型数组，那么wordIndex又是干嘛的呢？其源码如下：

/**

 * Given a bit index, return word index containing it.

 */

private static int wordIndex(int bitIndex) {

    return bitIndex >> ADDRESS_BITS_PER_WORD;
}

这里涉及到一个常量ADDRESS_BITS_PER_WORD，先解释一下，源码中的定义如下：

private final static int ADDRESS_BITS_PER_WORD = 6;

那么很明显2^6=64,所以，当我们传进129作为参数的时候，我们会申请一个long[(129-1)>>6+1]也就是long[3]的数组，到此就很明白了，实际上替代办法的1）和2）是很相似的：都是通过一个整数（4个byte或者8个byte）来表示一定的bit位，之后，通过与十六位进制的数进行and,or,~等等操作进行Bit位的操作。

接下来讲讲其他比较重要的方法

1）set方法，源码如下：

/**

    * Sets the bit at the specified index to <code>true</code>.

    *

    * @param     bitIndex   a bit index.

    * @exception IndexOutOfBoundsException if the specified index is negative.

    * @since     JDK1.0

    */

   public void set(int bitIndex) {

   if (bitIndex < 0)

       throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);
 
       int wordIndex = wordIndex(bitIndex);

   expandTo(wordIndex);
 
   words[wordIndex] |= (1L << bitIndex); // Restores invariants
 
   checkInvariants();

   }

这个方法将bitIndex位上的值由false设置为true,解释如下：

我们设置的时候很明显是在改变long数组的某一个元素的值，首先需要确定的是改变哪一个元素，其次需要使用与或操作改变这个元素，在上面的代码中，首先将bitIndex>>6，这样就确定了是修改哪一个元素的值，其次这里涉及到一个expandTo方法，我们先跳过去，直接看代码：

words[wordIndex] |= (1L << bitIndex); // Restores invariants

这里不是很好理解，要注意：需要注意的是java中的移位操作会模除位数，也就是说，long类型的移位会模除64。例如对long类型的值左移65位，实际是左移了65%64=1位。所以这行代码就等于：

int transderBits = bitIndex % 64;
words[wordsIndex] |= (1L << transferBits);

上面这样写就很清楚了。

与之相对的一个方法是：

/**

    * Sets the bit specified by the index to <code>false</code>.

    *

    * @param     bitIndex   the index of the bit to be cleared.

    * @exception IndexOutOfBoundsException if the specified index is negative.

    * @since     JDK1.0

    */

   public void clear(int bitIndex) {

   if (bitIndex < 0)

       throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);
 
   int wordIndex = wordIndex(bitIndex);

   if (wordIndex >= wordsInUse)

       return;
 
   words[wordIndex] &= ~(1L << bitIndex);
 
   recalculateWordsInUse();

   checkInvariants();

   }

这段代码理解上与set大同小异,主要是用来设置某一位上的值为false的。

上面有个方法，顺带着解释一下：

expandTo方法：

/**

     * Ensures that the BitSet can accommodate a given wordIndex,

     * temporarily violating the invariants.  The caller must

     * restore the invariants before returning to the user,

     * possibly using recalculateWordsInUse().

     * @param   wordIndex the index to be accommodated.

     */

    private void expandTo(int wordIndex) {

    int wordsRequired = wordIndex+1;

    if (wordsInUse < wordsRequired) {

        ensureCapacity(wordsRequired);

        wordsInUse = wordsRequired;

    }

    }

这里面又有个参数wordsInUse,定义如下：

/**

 * The number of words in the logical size of this BitSet.

 */

private transient int wordsInUse = 0;

根据其定义解释，这个参数表示的是BitSet中的words的逻辑大小。当我们传进一个wordIndex的时候，首先需要判断这个逻辑大小与wordIndex的大小关系，如果小于它，我们就调用方法ensureCapacity:

private void ensureCapacity(int wordsRequired) {

    if (words.length < wordsRequired) {

        // Allocate larger of doubled size or required size

        int request = Math.max(2 * words.length, wordsRequired);

            words = Arrays.copyOf(words, request);

            sizeIsSticky = false;

        }

    }

也就是说将words的大小变为原来的两倍，复制数组，标志sizeIsSticky为false,这个参数的定义如下：

/**

 * Whether the size of "words" is user-specified.  If so, we assume

 * the user knows what he's doing and try harder to preserve it.

 */

private transient boolean sizeIsSticky = false;

执行完这个方法后，我们可以将wordsInUse设置为wordsRequired。（换句话说，BitSet具有自动扩充的功能）

2）get方法：

/**

     * Returns the value of the bit with the specified index. The value

     * is <code>true</code> if the bit with the index <code>bitIndex</code>

     * is currently set in this <code>BitSet</code>; otherwise, the result

     * is <code>false</code>.

     *

     * @param     bitIndex   the bit index.

     * @return    the value of the bit with the specified index.

     * @exception IndexOutOfBoundsException if the specified index is negative.

     */

    public boolean get(int bitIndex) {

    if (bitIndex < 0)

        throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);
 
    checkInvariants();
 
    int wordIndex = wordIndex(bitIndex);

    return (wordIndex < wordsInUse)

        && ((words[wordIndex] & (1L << bitIndex)) != 0);

    }

这里主要是最后一个return语句，

return (wordIndex < wordsInUse) && ((words[wordIndex] & (1L << bitIndex)) != 0);

只有当wordIndex越界，并且wordIndex上的wordIndex上的bit不为0的时候，我们才说这一位是true.

3）size()方法：

/**

 * Returns the number of bits of space actually in use by this

 * <code>BitSet</code> to represent bit values.

 * The maximum element in the set is the size - 1st element.

 *

 * @return  the number of bits currently in this bit set.

 */

public int size() {

return words.length * BITS_PER_WORD;
}

这里也有一个常量，定义如下：

private final static int ADDRESS_BITS_PER_WORD = 6;

private final static int BITS_PER_WORD = 1 << ADDRESS_BITS_PER_WORD;

很明显，BITS_PER_WORD = 64，这里很重要的一点就是，如果使用size来返回BitSet数组的大小，其值一定是64的倍数，原因就在这里

4）与size相似的一个方法：length()源码如下：

/**

    * Returns the "logical size" of this <code>BitSet</code>: the index of

    * the highest set bit in the <code>BitSet</code> plus one. Returns zero

    * if the <code>BitSet</code> contains no set bits.

    *

    * @return  the logical size of this <code>BitSet</code>.

    * @since   1.2

    */

   public int length() {

       if (wordsInUse == 0)

           return 0;
 
       return BITS_PER_WORD * (wordsInUse - 1) +

       (BITS_PER_WORD - Long.numberOfLeadingZeros(words[wordsInUse - 1]));

   }

方法虽然短小，却比较难以理解，细细分析一下：根据注释，这个方法法返回的是BitSet的逻辑大小，比如说你声明了一个129位的BitSet,设置了第23，45，67位，那么其逻辑大小就是67，也就是说逻辑大小其实是的是在你设置的所有位里面最高位的Index。

这里有一个方法，Long.numberOfLeadingZeros，网上没有很好的解释，做实验如下：

long test = 1;<br>System.out.println(Long.numberOfLeadingZeros(test<<3));<br>System.out.println(Long.numberOfLeadingZeros(test<<40));<br>System.out.println(Long.numberOfLeadingZeros(test<<40 | test<<4));