原文地址: https://neverknowstomorrow.github.io/2019/04/15/HashSet/
https://juejin.cn/post/6844904106855759879
- HashSet 原理
实际上HashSet的底层实现依赖于HashMap。HashSet调用add()方法时内部实现是HashMap()的put()方法,将add()的参数值作为Key,一个空Object对象作为值存入HashMap中。
/**
* Constructs a new, empty set; the backing {@code HashMap} instance has
* default initial capacity (16) and load factor (0.75).
*/
public HashSet() {
map = new HashMap<>();
}
......
/**
* Adds the specified element to this set if it is not already present.
* More formally, adds the specified element {@code e} to this set if
* this set contains no element {@code e2} such that
* {@code Objects.equals(e, e2)}.
* If this set already contains the element, the call leaves the set
* unchanged and returns {@code false}.
*
* @param e element to be added to this set
* @return {@code true} if this set did not already contain the specified
* element
*/
public boolean add(E e) {
return map.put(e, PRESENT)==null;
}
那么HashMap的key是如何保持唯一性的呢?
/**
* Associates the specified value with the specified key in this map.
* If the map previously contained a mapping for the key, the old
* value is replaced.
*
* @param key key with which the specified value is to be associated
* @param value value to be associated with the specified key
* @return the previous value associated with {@code key}, or
* {@code null} if there was no mapping for {@code key}.
* (A {@code null} return can also indicate that the map
* previously associated {@code null} with {@code key}.)
*/
public V put(K key, V value) {
return putVal(hash(key), key, value, false, true);
}
/**
* Implements Map.put and related methods.
*
* @param hash hash for key
* @param key the key
* @param value the value to put
* @param onlyIfAbsent if true, don't change existing value
* @param evict if false, the table is in creation mode.
* @return previous value, or null if none
*/
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
boolean evict) {
Node<K,V>[] tab; Node<K,V> p; int n, i;
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
else {
Node<K,V> e; K k;
if (p.hash == hash &&
((k = p.key) == key || (key != null && key.equals(k))))
e = p;
else if (p instanceof TreeNode)
e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
else {
for (int binCount = 0; ; ++binCount) {
if ((e = p.next) == null) {
p.next = newNode(hash, key, value, null);
if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
treeifyBin(tab, hash);
break;
}
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
break;
p = e;
}
}
if (e != null) { // existing mapping for key
V oldValue = e.value;
if (!onlyIfAbsent || oldValue == null)
e.value = value;
afterNodeAccess(e);
return oldValue;
}
}
++modCount;
if (++size > threshold)
resize();
afterNodeInsertion(evict);
return null;
}
代码的主要操作是:先调用对象的 hashCode () 方法得到一个哈希值,然后在集合中查找是否有哈希值相同的对象
如果没有哈希值相同的对象,就直接存入集合
如果有哈希值相同的对象,就和哈希值相同的对象逐个进行 equals () 比较,比较结果为 false 就存入对象,为 true 则不存 key,仅更新 value
- 将自定义类的对象存入 HashSet 去重复
类中必须重写 hashCode () 和 equals () 方法
hashCode (): 属性相同的对象返回值必须相同,属性不同的返回值尽量不同(提高效率)。
equals (): 属性相同返回 true, 属性不同返回 false, 返回 false 的时候存储(注意存储自定义对象去重时必须同时重写 hashCode () 和 equals () 方法,因为 equals 方法默认是按照对象地址值比较的)。
- HashMap的 6和8
在hashMap底层采用数组+链表/红黑树的方式实现,当冲突链表的长度大于8时,则会将链表转化为红黑树以缩短查询的复杂度,当经过删减,冲突链表的长度小于6时,红黑树又会变回链表。
注意:链表转化为红黑树之前会进行判断,若果阈值大于8,但是数组长度小于64,这时链表不会转化为红黑树去存储数据,而是会对数组进行扩容。
如果数组比较小,应尽量避免红黑树结构。因为红黑树结构较为复杂,红黑树又称为平衡二叉树,需要进行左旋、右旋、变色这些操作才能保证平衡。在数组容量较小的情况下,操作数组要比操作红黑树更节省时间。综上所述:为了提高性能以及减少搜索时间,在阈值大于8并且数组长度大于64的情况下链表才会转化为红黑树而存在。
为什么是8?
/** Because TreeNodes are about twice the size of regular nodes, we use them only when
* bins contain enough nodes to warrant use (see TREEIFY_THRESHOLD). And when they * become too small (due to removal or resizing) they are converted back to plain bins.
* In usages with well-distributed user hashCodes, tree bins are rarely used. Ideally,
* under random hashCodes, the frequency of nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on
* average for the default resizing threshold of 0.75, although with a large variance
* because of resizing granularity. Ignoring variance, the expected occurrences of list
* size k are (exp(-0.5) * pow(0.5, k) / factorial(k)). The first values are:
*
* 0: 0.60653066
* 1: 0.30326533
* 2: 0.07581633
* 3: 0.01263606
* 4: 0.00157952
* 5: 0.00015795
* 6: 0.00001316
* 7: 0.00000094
* 8: 0.00000006
* more: less than 1 in ten million
*/
翻译过来的的值意思就是说:
红黑树节点所占空间是普通链表节点的两倍,在理想情况下链表中存储数据的频率符合泊松分布,我们可以看到,在链表长度为8的节点上存储数据的概率是0.00000006,这也就表明超过8以后的节点存储数据的概率就非常小了,因此将阈值再往后调的意义不是很大。
由上述分析可以得出:
如果小于阈值8就是用红黑树,会使得结构一开始就很复杂;
如果大于阈值8还使用链表,会导致难以触发转化,不能达到减少时间的效果;
所以,阈值8是科学合理的一个值,是空间和时间的权衡值。