HashSet如何保持元素唯一性

原文地址: https://neverknowstomorrow.github.io/2019/04/15/HashSet/
https://juejin.cn/post/6844904106855759879

  1. HashSet 原理
    实际上HashSet的底层实现依赖于HashMap。HashSet调用add()方法时内部实现是HashMap()的put()方法,将add()的参数值作为Key,一个空Object对象作为值存入HashMap中。
    /**
     * Constructs a new, empty set; the backing {@code HashMap} instance has
     * default initial capacity (16) and load factor (0.75).
     */
    public HashSet() {
        map = new HashMap<>();
    }
    ......
    
    /**
     * Adds the specified element to this set if it is not already present.
     * More formally, adds the specified element {@code e} to this set if
     * this set contains no element {@code e2} such that
     * {@code Objects.equals(e, e2)}.
     * If this set already contains the element, the call leaves the set
     * unchanged and returns {@code false}.
     *
     * @param e element to be added to this set
     * @return {@code true} if this set did not already contain the specified
     * element
     */
    public boolean add(E e) {
        return map.put(e, PRESENT)==null;
    }

那么HashMap的key是如何保持唯一性的呢?


    /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with {@code key}, or
     *         {@code null} if there was no mapping for {@code key}.
     *         (A {@code null} return can also indicate that the map
     *         previously associated {@code null} with {@code key}.)
     */
    public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }
    
    /**
     * Implements Map.put and related methods.
     *
     * @param hash hash for key
     * @param key the key
     * @param value the value to put
     * @param onlyIfAbsent if true, don't change existing value
     * @param evict if false, the table is in creation mode.
     * @return previous value, or null if none
     */
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

代码的主要操作是:先调用对象的 hashCode () 方法得到一个哈希值,然后在集合中查找是否有哈希值相同的对象

如果没有哈希值相同的对象,就直接存入集合
如果有哈希值相同的对象,就和哈希值相同的对象逐个进行 equals () 比较,比较结果为 false 就存入对象,为 true 则不存 key,仅更新 value

  1. 将自定义类的对象存入 HashSet 去重复

类中必须重写 hashCode () 和 equals () 方法
hashCode (): 属性相同的对象返回值必须相同,属性不同的返回值尽量不同(提高效率)。
equals (): 属性相同返回 true, 属性不同返回 false, 返回 false 的时候存储(注意存储自定义对象去重时必须同时重写 hashCode () 和 equals () 方法,因为 equals 方法默认是按照对象地址值比较的)。

  1. HashMap的 6和8

在hashMap底层采用数组+链表/红黑树的方式实现,当冲突链表的长度大于8时,则会将链表转化为红黑树以缩短查询的复杂度,当经过删减,冲突链表的长度小于6时,红黑树又会变回链表。

注意:链表转化为红黑树之前会进行判断,若果阈值大于8,但是数组长度小于64,这时链表不会转化为红黑树去存储数据,而是会对数组进行扩容。

如果数组比较小,应尽量避免红黑树结构。因为红黑树结构较为复杂,红黑树又称为平衡二叉树,需要进行左旋、右旋、变色这些操作才能保证平衡。在数组容量较小的情况下,操作数组要比操作红黑树更节省时间。综上所述:为了提高性能以及减少搜索时间,在阈值大于8并且数组长度大于64的情况下链表才会转化为红黑树而存在。

为什么是8?

/** Because TreeNodes are about twice the size of regular nodes, we use them only when  
 * bins contain enough nodes to warrant use (see TREEIFY_THRESHOLD). And when they         * become too small (due to removal or resizing) they are converted back to plain bins.   
* In usages with well-distributed user hashCodes, tree bins are rarely used.  Ideally,   
* under random hashCodes, the frequency of nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on  
* average for the default resizing threshold of 0.75, although with a large variance     
* because of resizing granularity. Ignoring variance, the expected occurrences of list   
* size k are (exp(-0.5) * pow(0.5, k) / factorial(k)). The first values are:
*
  * 0:    0.60653066
  * 1:    0.30326533
  * 2:    0.07581633
  * 3:    0.01263606
  * 4:    0.00157952
  * 5:    0.00015795
  * 6:    0.00001316
  * 7:    0.00000094
  * 8:    0.00000006
  * more: less than 1 in ten million
  */

翻译过来的的值意思就是说:

红黑树节点所占空间是普通链表节点的两倍,在理想情况下链表中存储数据的频率符合泊松分布,我们可以看到,在链表长度为8的节点上存储数据的概率是0.00000006,这也就表明超过8以后的节点存储数据的概率就非常小了,因此将阈值再往后调的意义不是很大。
由上述分析可以得出:

如果小于阈值8就是用红黑树,会使得结构一开始就很复杂;
如果大于阈值8还使用链表,会导致难以触发转化,不能达到减少时间的效果;
所以,阈值8是科学合理的一个值,是空间和时间的权衡值。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值