Golang语言系列-哈希表

原创

于 2024-11-16 09:59:18 发布 · 1.2k 阅读

27 ·

CC 4.0 BY-SA版权

文章标签：

#golang #数据结构 #哈希表

Golang语言系列-哈希表

源码分析
- 普通哈希表
sync.map
参考

哈希表也是每一种语言最常用的数据结构之一。一般计算机编程语言都是使用拉链法来解决哈希表哈希冲突的问题，而且一般有2的幂方个的bucket，key通过哈希函数然后再对bucket的数量取模落到某个桶里。如果某个桶里已经有元素了，则使用拉链法插上一个新节点来保存数据。与java语言不同的是，golang里一个bucket里用overflow指针连接不同的bucket，然后一个bucket有八个槽位，比如key哈希后得到64位，那么低B位用于决定放在那一个bucket里，高8位用于决定放入bucket的那个槽位中，有点像redis底层数据结构中的quicklist，即链表的每一个节点又是一个数组，兼用数组和链表的优势，提升查找效率，而java使用长链表转换为红黑树来缓解这个问题的。同样的，关于扩容，也是基于一个叫做负载因子的量，负载因子等于哈希表时间存储元素数量除以桶的数量，在java中，阈值是0.75，而在golang里，这个值是6.5，当然还有其他的条件来判断是否要扩容。扩容都是将桶的数量扩容为原来的两倍。只不过，在golang里，用的是类似于redis的渐进式扩容，将扩容操作分摊到每一次的增删操作中进行，一次增删操作可能迁移一个桶的数据。
同样的是，普通的哈希表是并发不安全的，即使是一边遍历一边删除也会触发panic。在golang里，并发安全的map可以使用 sync.map，这个map适用于读多写少的场景，内部使用两个哈希表，一个read map用来只读，一个dirty map用来加锁下去写。当频繁去读dirty map或者dirty map中有较多新的数据，则将dirty map转换为read map，如果有新的请求到来，则初始化一个dirty map，然后拷贝read map的数据过去。
本文将对这两种哈希表进行详细的介绍，并且进行源码分析和实验验证。

源码分析

golang源码版本基于go1.21。

普通哈希表

源码位于runtime包下的map.go文件中。

makemap

首先来看make一个map的时候底层的函数，代码如下。值得关注的是，如果通过make(map[int]int, 8)这种方式指定容量，会计算出一个桶的数量，满足当8个元素插入哈希表，哈希表不会发生扩容。

func makemap(t *maptype, hint int, h *hmap) *hmap {
   
   
	mem, overflow := math.MulUintptr(uintptr(hint), t.Bucket.Size_)
	if overflow || mem > maxAlloc {
   
   
		hint = 0
	}

	// initialize Hmap
	if h == nil {
   
   
		h = new(hmap)   // new关键字
	}
	h.hash0 = fastrand()

	// Find the size parameter B which will hold the requested # of elements.
	// For hint < 0 overLoadFactor returns false since hint < bucketCnt.
	B := uint8(0)
	for overLoadFactor(hint, B) {
   
      // 首先为hint找到一个合适的桶的数量，满足不会扩容的条件
		B++
	}
	h.B = B

	// allocate initial hash table
	// if B == 0, the buckets field is allocated lazily later (in mapassign)
	// If hint is large zeroing this memory could take a while.
	if h.B != 0 {
   
   
		var nextOverflow *bmap
		h.buckets, nextOverflow = makeBucketArray(t, h.B, nil)
		if nextOverflow != nil {
   
   
			h.extra = new(mapextra)
			h.extra.nextOverflow = nextOverflow
		}
	}

	return h
}

再来看一下map的整体结构如下

// A header for a Go map.
type hmap struct {
   
   
	// Note: the format of the hmap is also encoded in cmd/compile/internal/reflectdata/reflect.go.
	// Make sure this stays in sync with the compiler's definition.
	count     int // # live cells == size of map.  Must be first (used by len() builtin)  // 元素数量
	flags     uint8   // 标志位
	B         uint8  // log_2 of # of buckets (can hold up to loadFactor * 2^B items)   // 2^B表示桶的数量
	noverflow uint16 // approximate number of overflow buckets; see incrnoverflow for details   // 溢出桶的大概数量
	hash0     uint32 // hash seed  // hash时的随机种子

	buckets    unsafe.Pointer // array of 2^B Buckets. may be nil if count==0.
	oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growing  // 渐进式扩容时需要
	nevacuate  uintptr        // progress counter for evacuation (buckets less than this have been evacuated)  // 渐进式扩容，搬迁精度

	extra *mapextra // optional fields
}

再来看一下桶结构体的定义，代码如下：

// A bucket for a Go map.
type bmap struct {
   
   
	// tophash generally contains the top byte of the hash value
	// for each key in this bucket. If tophash[0] < minTopHash,
	// tophash[0] is a bucket evacuation state instead.
	// 一个桶里有八个槽位，第一个槽位值小于minTopHash时，说明该槽位值用来表示这个桶的搬迁进度
	tophash [bucketCnt]uint8
	// Followed by bucketCnt keys and then bucketCnt elems.
	// NOTE: packing all the keys together and then all the elems together makes the
	// code a bit more complicated than alternating key/elem/key/elem/... but it allows
	// us to eliminate padding which would be needed for, e.g., map[int64]int8.
	// Followed by an overflow pointer.
}

事实上，在编译期间会动态地创建一个新的结构，如下。
在这里插入图片描述

创建出来的新结构体大致如下：

type bmap struct {
   
   
    topbits  [8]uint8
    keys     [8]keytype
    values   [8]valuetype
    pad      uintptr
    overflow uintptr
}

每一个元素的tophash值、key、value都被分开存放，源码里解释说对于map[int64]int8这种，可以有效地避免内存对齐带来的空间浪费。

mapaccess

关于get操作源码里定义两个函数，函数签名分别为mapaccess1(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer和func mapaccess2(t *maptype, h *hmap, key unsafe.Pointer) (unsafe.Pointer, bool)。这个就是带comma和不带cmma的实现原理，编译器分析语法选择使用那一个函数。代码如下：

// 返回的value以指针形式返回，不能持有太久，其会导致整个hashmap在垃圾回收中都是存活状态
func mapaccess2(t *maptype, h *hmap, key unsafe.Pointer) (unsafe.Pointer, bool) {
   
   
	if raceenabled && h != nil {
   
   
		callerpc := getcallerpc()
		pc := abi.FuncPCABIInternal(mapaccess2)
		racereadpc(unsafe.Pointer(h), callerpc, pc)
		raceReadObjectPC(t.Key, key, callerpc, pc)
	}
	if msanenabled && h != nil {
   
   
		msanread(key, t.Key.Size_)
	}
	if asanenabled && h != nil {
   
   
		asanread(key, t.Key.Size_)
	}
	if h == nil || h.count == 0 {
   
   
		if t.HashMightPanic() {
   
   
			t.Hasher(key, 0) // see issue 23734
		}
		// 返回零值
		return unsafe.Pointer(&zeroVal[0]), false
	}
	// 表示有另外一个协程在写这个map,说明map不是并发安全的
	if h.flags&hashWriting != 0 {
   
     
		fatal("concurrent map read and map write")
	}
	hash := t.Hasher(key, uintptr(h.hash0))
	m := bucketMask(h.B)  // 1 << B - 1, hash&m相当于hash % (m + 1)
	// 通过指针转换指针运算获取该元素所在桶的起始地址
	b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.BucketSize)))
	if c := h.oldbuckets; c != nil {
   
     // 渐进式扩容中
		// 扩容存在两种扩容时机，两种扩容方式
		// 一、元素数量过大，导致超过负载因子，两倍扩容
		// 二、溢出桶过多，key比较分散，一倍扩容，相当于做整理

		// 两倍扩容下，旧桶大小为原来桶大小的一半，所以这里右移一位，方便下面获取key所在的旧桶的起始位置
		if !h.sameSizeGrow() {
   
   
			// There used to be half as many buckets; mask down one more power of two.
			m >>= 1
		}
		oldb := (*bmap)(add(c, (hash&m)*uintptr(t.BucketSize)))
		if !evacuated(oldb) {
   
      // 没有搬迁完
			b = oldb
		}
	}
	// 计算topHash，取高八位，如果值小于minTophash,需要加上minTophash,，因为小于minTopHash的需要用来表示桶的搬迁状态
	top := tophash(hash)  
bucketloop:
// 遍历所在的桶即之后可能存在的用指针链接起来的溢出桶
	for ; b != nil; b = b.overflow(t) {
   
     
		// 遍历每一个槽位，通过tophash值比较
		for i := uintptr(0); i < bucketCnt; i++ {
   
   
			if b.tophash[i] != top {
   
   
				if b.tophash[i] == emptyRest {
   
   
					break bucketloop
				}
				continue
			}
			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.KeySize))   // 找到tophash匹配的key的起始地址
			if t.IndirectKey() {
   
   
				k = *((*unsafe.Pointer)(k))
			}
			if t.Key.Equal(key, k) {
   
      // 还要比较key值是否相等
				// 找到对应值在内存中的起始地址
				e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.KeySize)+i*uintptr(t.ValueSize)) 
				if t.IndirectElem() {
   
   
					e = *((*unsafe.Pointer)(e))
				}
				return e, true
			}
		}
	}
	return unsafe.Pointer(&zeroVal[0]), false
}

// returns both key and elem. Used by map iterator.
func mapaccessK(t *maptype, h *hmap, key unsafe.Pointer) (unsafe.Pointer, unsafe.Pointer) {
   
   
	if h == nil || h.count == 0 {
   
   
		return nil, nil
	}
	hash := t.Hasher(key, uintptr(h.hash0))
	m := bucketMask(h.B)
	b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.BucketSize)))
	if c := h.oldbuckets; c != nil {
   
   
		if !h.sameSizeGrow() {
   
   
			// There used to be half as many buckets; mask down one more power of two.
			m >>= 1
		}
		oldb := (*bmap)(add(c, (hash&m)*uintptr(t.BucketSize)))
		if !evacuated(oldb) {
   
   
			b = oldb
		}
	}
	top := tophash(hash)
bucketloop:   // 配合break实现跳出外层循环
	for ; b != nil; b = b.overflow(t) {
   
   
		for i := uintptr(0); i < bucketCnt; i++ {
   
   
			if b.tophash[i] != top {
   
   
				if b.tophash[i] == emptyRest {
   
   
					break bucketloop
				}
				continue
			}
			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.KeySize))
			if t.IndirectKey() {
   
   
				k = *((*unsafe.Pointer)(k))
			}
			if t.Key.Equal(key, k) {
   
   
				e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.KeySize)+i*uintptr(t.ValueSize))
				if t.IndirectElem() {
   
   
					e = *((*unsafe.Pointer)(e))
				}
				return k, e
			}
		}
	}
	return nil, nil
}

总结一下具体查找过程：首先通过指针运算找到桶和旧桶的起始地址，检查旧桶的搬迁情况，如果未完成搬迁，则访问旧桶，否则访问新桶。在访问过程中，根据key的哈希值定位到所在的桶的位置，然后遍历每一个槽位，检查各个槽位的tophash是否和当前tophash相等，如果相等，进一步检查key是否相等，若相等，则返回对应的value。

mapassign

再来看一下如何实现set操作，相关逻辑在函数mapassign中，代码如下：

// Like mapaccess, but allocates a slot for the key if it is not present in the map.
func mapassign(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {

最低0.47元/天解锁文章