HashMap实现原理分析-resize()详解

最新推荐文章于 2024-07-16 20:54:06 发布

原创最新推荐文章于 2024-07-16 20:54:06 发布 · 2.2k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#博客 #hashmap #java #性能 #数据

HashMap在数据量过多时，单链表可能导致查询性能下降到O(n)。resize()方法通过增大散列表长度来优化，使得key的bucketIndex更均匀分布，保持O(1)性能。当链表长度达到8时，Java8会转换为TreeMap，查询性能提升到O(logN)。文章探讨了resize()的实现和如何提升HashMap的性能。

部署运行你感兴趣的模型镜像

为什么会有resize()方法

介绍resize() 方法前先了解一下Java为什么会有resize()方法，他的作用是什么，我们有一个默认认知是，HashMap的get查找的复杂度是O(1)的，那么如果初始散列表大小是16，加载因子是0.75的话，如果数据量过多（例如256），按照拉链法，每一个bucketIndex位置上的单链表的长度都会很长（并触发上节所贴代码的红黑树转化），在单链表中查找元素的复杂度为O(n),几乎远远不能达到O(1)的性能，在hashMap的情景中优化O(n) 的方式就是使n足够小，即BucketIndex碰撞的机会足够小。那么我们就需要加大散列表的长度，使key的hashCode计算出的bucketIndex均匀分散，所以java中使用了resize() 方法拉大散列表。

*hash碰撞的时候，Java8把链表替换成了TreeMap，使得查询性能提升为O(logN)，这个值为8，即链表长度为8时，将转换链表为TreeMap。

我们先看看Java中hashMap的各成员变量，先忽略与树有关的部分。

     /**
     * The default initial capacity - MUST be a power of two.
	 * 箱子的个数不能太多或太少。如果太少，很容易触发扩容，如果太多，遍历哈希表会比较慢。
     */
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

    /**
     * The maximum capacity, used if a higher value is implicitly specified
     * by either of the constructors with arguments.
     * MUST be a power of two <= 1<<30.
	 * 最大容量为2的30次方  
     */
    static final int MAXIMUM_CAPACITY = 1 << 30;

    /**
     * The load factor used when none specified in constructor.
	 * 默认加载因子0.75  
     */
    static final float DEFAULT_LOAD_FACTOR = 0.75f;

    /**
     * The bin count threshold for using a tree rather than list for a
     * bin.  Bins are converted to trees when adding an element to a
     * bin with at least this many nodes. The value must be greater
     * than 2 and should be at least 8 to mesh with assumptions in
     * tree removal about conversion back to plain bins upon
     * shrinkage.
	 *如果哈希函数不合理，即使扩容也无法减少箱子中链表的长度，因此 Java 的处理方案是当链表太长时，转换成红黑树。这个值表示当某个箱子中，链表长度大于 8 时，有可能会转化成树。
     */
    static final int TREEIFY_THRESHOLD = 8;

    /**
     * The bin count threshold for untreeifying a (split) bin during a
     * resize operation. Should be less than TREEIFY_THRESHOLD, and at
     * most 6 to mesh with shrinkage detection under removal.
	 * 在哈希表扩容时，如果发现链表长度小于 6，则会由树重新退化为链表。
     */
    static final int UNTREEIFY_THRESHOLD = 6;

    /**
     * The smallest table capacity for which bins may be treeified.
     * (Otherwise the table is resized if too many nodes in a bin.)
     * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
     * between resizing and treeification thresholds.
	 * 在转变成树之前，还会有一次判断，只有键值对数量大于 64 才会发生转换。这是为了避免在哈希表建立初期，多个键值对恰好被放入了同一个链表中而导致不必要的转化。
     */
    static final int MIN_TREEIFY_CAPACITY = 64;/**
     * The table, initialized on first use, and resized as
     * necessary. When allocated, length is always a power of two.
     * (We also tolerate length zero in some operations to allow
     * bootstrapping mechanics that are currently not needed.)
	 * 散列表（bucket）
     */
    transient Node<K,V>[] table;


    /**
     * Holds cached entrySet(). Note that AbstractMap fields are used
     * for keySet() and values().
     */
    transient Set<Map.Entry<K,V>> entrySet;


    /**
     * The number of key-value mappings contained in this map.
     */
    transient int size;


    /**
     * The number of times this HashMap has been structurally modified
     * Structural modifications are those that change the number of mappings in
     * the HashMap or otherwise modify its internal structure (e.g.,
     * rehash).  This field is used to make iterators on Collection-views of
     * the HashMap fail-fast.  (See ConcurrentModificationException).
     */
    transient int modCount;
    /**
     * The next size value at which to resize (capacity * load factor).
     *
     * @serial
     */
    // (The javadoc description is true upon serialization.
    // Additionally, if the table array has not been allocated, this
    // field holds the initial array capacity, or zero signifying
    // DEFAULT_INITIAL_CAPACITY.
    int threshold; //当前散列表的临界值
    /**
     * The load factor for the hash table.
     *
     * @serial
     */
    final float loadFactor; //当前的加载因子

resize方法的代码如下：

/**
     * Initializes or doubles table size.  If null, allocates in
     * accord with initial capacity target held in field threshold.
     * Otherwise, because we are using power-of-two expansion, the
     * elements from each bin must either stay at same index, or move
     * with a power of two offset in the new table.
     *
     * @return the table
     */
    final Node<K,V>[] resize() {
        Node<K,V>[] oldTab = table;  //table即原散列表，当map初始化以后，第一次put时，该值为null
        int oldCap = (oldTab == null) ? 0 : oldTab.length; //记录原始的散列表大小
        int oldThr = threshold;  //当前使用的桶大小
        int newCap, newThr = 0; //初始化新值
        if (oldCap > 0) { //如果原始大小大于0 
            if (oldCap >= MAXIMUM_CAPACITY) { //如果桶大小大于最大容量，则直接返回Integer.MAX_VALUE,并返回原值，不错扩容处理
                threshold = Integer.MAX_VALUE; 
                return oldTab;
            }
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)   //如果新值的2倍小于最大容量，并且原始大小大于默认初始化值
                newThr = oldThr << 1; // double threshold  //则设置新临界值为2倍原临界值
        }
        else if (oldThr > 0) // initial capacity was placed in threshold  
            newCap = oldThr;      
        else {               // zero initial threshold signifies using defaults   //如使用默认构造器，第一次put时肯定进入该代码段，即初始大小为默认16，初始临界值为12
            newCap = DEFAULT_INITIAL_CAPACITY;   
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        if (newThr == 0) {
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        threshold = newThr; //设置临界值变量为新计算的值
        @SuppressWarnings({"rawtypes","unchecked"})
            Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap]; //按照新算出来的大小来初始化一个新的散列表
        table = newTab; //将新桶赋值给成员变量
        if (oldTab != null) { //如果旧的散列表里有数据，将旧散列表中的数据取出按新散列表大小重新计算BucketIndex，并存于新散列表中
            for (int j = 0; j < oldCap; ++j) { 
                Node<K,V> e;
                if ((e = oldTab[j]) != null) {
                    oldTab[j] = null;
                    if (e.next == null)
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode)
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // preserve order
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
                            next = e.next;
                            if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        if (loTail != null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        if (hiTail != null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }

总结：

1.resize时，HashMap使用新数组代替旧数组，对原有的元素根据hash值重新就算索引位置，重新安放所有对象；resize是耗时的操作。
2.每次resize新的散列数组长度是原来的2倍
3.当HashMap散列数组的长度大于>2的30次幂将不再扩充数组，直接将数组大小设置为Integer.MAX_VALUE

4.当hash碰撞较多时，链表长度大于等于8将转换单链表至红黑树（Java8优化）

优化hashMap

由以上代码分析得知为使 Map 对象有效地处理任意数目的项，Map 实现可以调整自身的大小。但调整大小的开销很大。调整大小需要将所有元素重新插入到新数组中，这是因为不同的数组大小意味着对象现在映射到不同的索引值。先前冲突的键可能不再冲突，而先前不冲突的其他键现在可能冲突。这显然表明，如果将 Map 调整得足够大，则可以减少甚至不再需要重新调整大小，这很有可能显著提高速度。

如何提升性能？

1.当你要创建一个比较大的hashMap时，充分利用另一个构造函数

/**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor
     * @throws IllegalArgumentException if the initial capacity is negative
     *         or the load factor is nonpositive
     */
    public HashMap(int initialCapacity, float loadFactor)

initialCapacity：初始容量和loadFactor：加载因子。容量是哈希表中桶的数量，初始容量只是哈希表在创建时的容量。加载因子是哈希表在其容量自动增加之前可以达到多满的一种尺度。当哈希表中的条目数超出了加载因子与当前容量的乘积时，通过调用 rehash 方法将容量翻倍。
应该避免HashMap多次进行了hash重构,扩容是一件很耗费性能的事，在默认中initialCapacity只有16，而loadFactor是 0.75，需要多大的容量，你最好能准确的估计你所需要的最佳大小，同样的Hashtable，Vectors也是一样的道理。

您可能感兴趣的与本文相关的镜像