请回答数据结构【哈希桶和模拟unordered容器】-优快云博客

本文链接：https://blog.youkuaiyun.com/Allen9012/article/details/125535260

BingWallpaper30

1. 实现闭散列

1.0 基本结构

template <class K,class V>
struct HashData
{
    pair<K, V> _kv;
};

template <class K, class V>
class HashTable
{
public:
private:
    vector<HashData> _table;
    size_t _n=0;  //存储的有效数据个数
};

采用闭散列处理哈希冲突时，不能随便物理删除哈希表中已有的元素，若直接删除元素会影响其他
元素的搜索。因此线性探测采用标记的伪删除法来删除一个元素。

enum State
{
    EMPTY,
    EXIST,
    DELETE,
};

template <class K,class V>
struct HashData
{
    pair<K, V> _kv;
    State _state = EMPTY;//默认给空
};

template <class K, class V>
class HashTable
{
public:
private:
    vector<HashData> _table;
    size_t _n=0;  //存储的有效数据个数
};

1.1 Insert

通过哈希函数获取待插入元素在哈希表中的位置

首先我们考虑一个问题：究竟index是下面哪一个？（底层是vector）

size_t index = kv.first % _table.size();
size_t index = kv.first % _table.capacity();

因为对于vector来说只让你访问最多size位，不让你访问整个capacity，所以一旦模过头了会超出size，无法使用

_table[index]= ...

因此使用size

bool Insert(const pair<K, V>& kv)
{	
    size_t start = kv.first % _table.size();
    size_t index = start;

    // 探测后面的位置 -- 线性探测 or 二次探测
    size_t i = 1;
    while (_table[index]._state == EXIST)
    {
        index = start + i;
        index %= _table.size();
        ++i;
    }

    _table[index]._kv = kv;
    _table[index]._state = EXIST;
    ++_n;

    return true;
}

那么还要考虑增容问题和重复问题

//防止重复
HashData* ret = Find(kv.first);
if (ret)
{
    return false;
}
//空表
if (_table.size()==0)
{
    _table.resize(10);
}
//负载0.7
else if ((double)_n / (double)_table.size() >=0.7)
{
    //增容
        vector<HashData> newtable;
        newtable.resize(_table.size*2);
        for (auto& e:_table)
        {
            if (e._state==EXIST)
            {
                //重新计算放到newtable
                //逻辑类似插入
            }
        }
        _table.swap(newtable);
}

这里我们发现在处理扩容问题的时候要把数据重新放到newtable中,这里的逻辑和插入部分实现逻辑很像,好像有点代码重复,其实我们可以有一个更好的解决方法,就是直接构造一个HashTable复用Insert,如果像之前的话不能够调用Insert,现在有一个table就可以了,这也是增容的现代版

//空表
if (_table.size()==0)
{
    _table.resize(10);
}
//负载0.7
else if ((double)_n / (double)_table.size() >=0.7)
{
    //增容
    HashTable<K, V> newHT;
    newHT._table.resize(_table.size() * 2);
    for (auto& e:  _table)
    {
        if (e._state == EXIST)
        {
            newHT.Insert(e._kv);
        }
    }
    _table.swap(newHT._table);
}

1.1.1 完整的Insert

bool Insert(const pair<K, V>& kv)
{
    //防止重复
    auto ret = Find(kv.first);
    if (ret)
    {
        return false;
    }

    //空表
    if (_table.size() == 0)
    {
        _table.resize(10);
    }
    //负载0.7
    else if ((double)_n / (double)_table.size() >= 0.7)
    {
        //增容
        HashTable<K, V,HashFunc> newHT;
        newHT._table.resize(_table.size() * 2);
        for (auto& e : _table)
        {
            if (e._state == EXIST)
            {
                newHT.Insert(e._kv);
            }
        }
        _table.swap(newHT._table);
    }

    HashFunc hf;
    size_t start = hf(kv.first) % _table.size();
    size_t index = start;

    // 探测后面的位置 -- 线性探测 or 二次探测
    size_t i = 1;
    while (_table[index]._state == EXIST)
    {
        index = start + i;
        index %= _table.size();
        ++i;
    }

    _table[index]._kv = kv;
    _table[index]._state = EXIST;
    ++_n;

    return true;
}

1.2 Find

对于size=0的判断，可以写一个判断也可以构造函数的时候就给一些size

HashData<K,V>* Find(const K& key)
{
    if (_table.size() == 0)
    {
        return nullptr;
    }

    //HashFunc hf;
    size_t start = hf(key) % _table.size();
    size_t index = start;
    size_t i = 1;
    while (_table[index]._state != EMPTY)
    {
        if ( _table[index]._kv.first == key)
        {
            return &_table[index];
        }

        index = start + i;
        index %= _table.size();
        ++i;
    }
}

这是后其实还是有问题的，因为我假如我删除100，之后要再去寻找，删除方法是修改了标识，这时候状态已经删除，但是寻找的时候还是找的到所以说，我们应该要在判断的时候，加一个条件

 if (_table[index]._state == EXIST 
 		&& _table[index]._kv.first == key)        
{
return &_table[index];
}

1.3 Erase

bool Erase(const K& key)
{
    HashData<K, V>* ret = Find(key);
    if (ret == nullptr)
    {
        return false;
    }
    else
    {
        ret->_state = DELETE;
        return true;
    }
}

1.4 string问题解决

我们发现如果之前的取模操作当对于发生在模板参数输入为string的时候会遇到问题，因为没有字符串的取模啊，所以我们可以来一个仿函数解决问题

template<class K>
struct int_HashFunc
{
    int operator()(int i)
    {
        return i;
    }
};

template<class K>
struct string_HashFunc
{
    size_t operator()(const string& s)
    {
        return s[0];
    }
};

但是这时候又不太好了，很多时候首字母都会重叠，那如果都是字符串的话，会导致很多都是重复的，那所以最好还是换一种形式映射，更好的方式就是字符串转成整型值来映射，比如我们可以把字符串每个字符ASCII码加起来转换为整形（当然也不是必须的因为整形也会超标，字符串可以无限长，整形是有范围的）

template<class K>
struct string_HashFunc
{
    size_t operator()(const string& s)
    {
        size_t value = 0;
        for (auto ch : s)
        {
            value += ch;
        }
        return value;
    }
};

1.4.1 BKDR

其实还是不够好"abcd"和"cdba"和"adad"都是一样的ASCII,都被分到了一起,于是大佬们搞定了字符串哈希算法,其中最有名的是BKDR哈希,累加相应的乘积

template<class T>  
size_t BKDRHash(const T *str)  
{  
    register size_t hash = 0;  
    while (size_t ch = (size_t)*str++)  
    {         
        hash = hash * 131 + ch;   // 也可以乘以31、131、1313、13131、131313..  
        // 有人说将乘法分解为位运算及加减法可以提高效率，如将上式表达为：hash = hash << 7 + hash << 1 + hash + ch;  
        // 但其实在Intel平台上，CPU内部对二者的处理效率都是差不多的，  
        // 我分别进行了100亿次的上述两种运算，发现二者时间差距基本为0（如果是Debug版，分解成位运算后的耗时还要高1/3）；  
        // 在ARM这类RISC系统上没有测试过，由于ARM内部使用Booth's Algorithm来模拟32位整数乘法运算，它的效率与乘数有关：  
        // 当乘数8-31位都为1或0时，需要1个时钟周期  
        // 当乘数16-31位都为1或0时，需要2个时钟周期  
        // 当乘数24-31位都为1或0时，需要3个时钟周期  
        // 否则，需要4个时钟周期  
        // 因此，虽然我没有实际测试，但是我依然认为二者效率上差别不大          
    }  
    return hash;  
}

下面是一些对不同哈希的测试https://blog.youkuaiyun.com/icefireelf/article/details/5796529,可以发现最后还是BKDR最好,那么就采用BKDR哈希就好了,那么现在仿函数只要单独写一个,然后特化出其他版本就可以了

1.4.2 实现Hash仿函数

template<class K>
struct Hash
{
    size_t operator()(const K& key)
    {
        return key;
    }
};

// 特化
template<>
struct Hash<string>
{
    size_t operator()(const string& s)
    {
        // BKDR Hash
        size_t value = 0;
        for (auto ch : s)
        {
            value += ch;
            value *= 131;
        }

        return value;
    }
};

template <class K, class V,class HashFunc=Hash<K>>
class HashTable
{
	...
}

void TestHashTable()
{
    string a[] = { "皮卡丘", "喷火龙", "皮卡丘", "喷火龙", "皮卡丘", "路卡利欧", "皮卡丘" };
    HashTable<string, int,Hash<string>> ht;
    for (auto str : a)
    {
        auto ret = ht.Find(str);
        if (ret)
        {
            ret->_kv.second++;
        }
        else
        {
            ht.Insert(make_pair(str, 1));
        }
    }
}

这里的部分很像Java的重写Hashcode,其实就是判断相等有很多条件,看需要什么,就相应判断

struct pokemon
{
    // ...
};

struct PokemonHashFunc
{
    size_t operator()(const pokemon& kv)
    {
        // 如果是结构体
        // 1、比如说结构体中有一个整形，基本是唯一值 - pokemon序号
        // 2、比如说结构体中有一个字符串，基本是唯一值 - pokemon name
        // 3、如果没有一项是唯一值，可以考虑多项组合
        size_t value = 0;
        // ...
        return value;
    }
};

我们的unordered类型容器就是可以传入一个Hash的仿函数

2. 实现开散列

开散列本质上是一个指针数组和链表结合，此时就会有一个问题，对于模拟实现开散列来说，我们可以使用list库函数吗？还是要自己实现一下链表，最好还是自己写链表，因为list的迭代器是一个增加麻烦的事情

2.0 HashNode

由于是一个指针数组，HashTable的私有成员只能写成双指针形式，看起来非常麻烦，那么我们这里把指针放入vector中，这样稍微好一点

template<class K,class V>
struct HashNode
{
    HashNode<K, V>* _next;
    pair<K, V> _kv;
};

template<class K, class V>
class HashTable 
{
    typedef HashNode<K, V> Node;
public:
private:
    vector<Node*> _table;//存的是指针
    size_t _n = 0;  //有效数据个数
};

2.1 Insert

如何实现插入呢？其实闭散列还要简单

在大小为 7 的哈希表中，键 42 和 38 将分别获得 0 和 3 作为哈希索引。

如果我们插入一个新元素52，那也将转到第四个索引，下标是3，因为52%7是3

实际上就效率来看，利用头插是效率更高的，因为尾插还有遍历取找尾，这显然效率上就不太好

bool Insert(const pair<K, V>& kv)
{
    if (Find(kv.first))
    {
        return false;
    }
    size_t index = kv.first % _table.size();
    Node* newnode = new Node(kv);
    //头插,而且也不用排空
    newnode->_next = _table[index];
    _table[index] = newnode;
    ++_n;
    return true;
}

接下来解决增容问题,当负载因子超过1的时候，table要开始增容，为了获取更多slot，此时不是直接把原来slot对应位置的所有链表直接拉下来，而是要重新mod，插入的思想，这时候难道我们还是按照闭散列的思想来做吗，这样复用代码还是有不好的地方因为，复用是在开新节点，而旧的节点也需要delete，这样得不偿失

bool Insert(const pair<K, V>& kv)
{
    //有相同数据直接false
    if (Find(kv.first))
        return false;

    //负载因子，到一的时候，进行增容
    if (_n == _table.size())
    {
        vector<Node*> newtable;
        size_t new_size = _table.size() == 0 ? 10 : _table.size() * 2;
        newtable.resize(new_size);
        //旧表节点重新算位置搞到新表
        for (size_t i=0;i<_table.size();++i)
        {
            if (_table[i])
            {
                Node* cur = _table[i];
                while (cur)
                {
                    Node* next = cur->_next;
                    size_t index = cur->_kv.first % newtable.size();
                    //头插
                    cur->_next = newtable[index];
                    newtable[index] = cur;
                    //原表迭代
                    cur = next;
                }
                _table[i] = nullptr;
            }
        }
        _table.swap(newtable);
    }

    //没有到1，直接链接
    size_t index = kv.first % _table.size();
    Node* newnode = new Node(kv);
    //头插,而且也不用排空
    newnode->_next = _table[index];
    _table[index] = newnode;
    ++_n;
    return true;
}

最后还可以在加上素数表，那么这里就不再写了

2.2 Find

查找很简单

Node* Find(const K& key)
{
	if (_table.size() == 0)
	{
		return nullptr;
	}
    size_t index = key % _table.size();
    Node* cur = _table[index];
    while (cur)
    {
        if (cur->_kv.first == key)
        {
            return cur;
        }
        else
        {
            cur = cur->_next;
        }
    }
    return nullptr;
}

2.3 Erase

在这个seperate chainning中删除的话加状态不是最好（当然也不是不可以），删除节点的方式

一般的话可以用一个prev指针记录前者的方式来做，这个是经典的链表删除法

然而有人给出是方式是替换法删除，也就是说但是这种方法不能删除尾节点吗，不过可以转化一下

这里还是采用了经典方法

bool Erase(const K& key)
{
    size_t index = key % _table.size();
    Node* cur = _table[index];
    Node* prev=nullptr;
    while (cur)
    {
        if (cur->_kv.first==key)
        {
            if (_table[index]==cur)
            {
                _table[index] = cur->_next;
            }
            else
            {
                prev->_next = cur->_next;
            }
            delete cur;
            cur = nullptr;
            return true;
        }
        prev = cur;
        cur = cur->_next;
    }
    return false;
}

2.4 Hash仿函数

老样子这里还要写一个仿函数

template<class K>
struct Hash
{
    size_t operator()(const K& key)
    {
        return key;
    }
};
// 特化
template<>
struct Hash<string>
{
    size_t operator()(const string& s)
    {
        // BKDR Hash
        size_t value = 0;
        for (auto ch : s)
        {
            value += ch;
            value *= 131;
        }

        return value;
    }
};

template<class K, class V,class HashFunc=Hash<K>>
class HashTable 
{
	...
}

2.5 iterator

实现unordered_map真正难点在于迭代器，而这里的迭代器用的就是HashTable的迭代器，所以这里我们来实现一下

2.5.1 基本结构

template<class K, class T, class Key_Of_T, class HashFunc = Hash<K>>
struct __HTIterator
{
    typedef HashNode<T> Node;
    typedef __HTIterator<K, T, Key_Of_T, HashFunc> Self;
    typedef HashTable<K, T, Key_Of_T, HashFunc> HT;
    Node* _node;
    HT* _pht;
    __HTIterator(Node* _node, HT* _pht)
        :_node(node)
        ,_pht(pht)
    {}
};

这里产生了特殊情况，就是__HTIterator，中出现了HashTable，但是HashTable同样也出现了__HTIterator，为了解决冲突，我们需要在迭代器之前前置声明

//前置声明
template<class K, class T, class Key_Of_T, class HashFunc>
class HashTable;
//迭代器类
template<class K, class T, class Key_Of_T, class HashFunc = Hash<K>>
struct __HTIterator
{
    typedef HashNode<T> Node;
    typedef __HTIterator<K, T, Key_Of_T, HashFunc> Self;
    typedef HashTable<K, T, Key_Of_T, HashFunc>   HT;
    Node* _node;
    HT* _pht;
    __HTIterator(Node* node, HT* pht)
        :_node(node)
        , _pht(pht)
    {}
  ...
}
//HashTable类
template<class K, class T, class Key_Of_T,class HashFunc =Hash<K>>
class HashTable 
{
    typedef HashNode<T> Node;
    //友元
   	template<class K, class T, class Key_Of_T, class HashFunc>
	friend struct __HTIterator;
    typedef __HTIterator<K, T, Key_Of_T, HashFunc> iterator;
	//...
}

为什么要有类模板友元这里参见operator++，这里是友元类所以说，要带上友元类的模板，模板这里不能写class HashFunc=Hash，因为

2.5.2 begin()和end()

typedef __HTIterator<K, T, Key_Of_T, HashFunc> iterator;

iterator begin()
{
    size_t i = 0;
    while (i<_table.size())
    {
        if(_table[i])
        {
            return iterator(_table[i], this);
        }
        ++i;
    }
    return end();
}

iterator end()
{
    return iterator(nullptr, this);
}

2.5.3 operator++和operator–

迭代器的难点在于实现operator++和operator–,当迭代器++之后，如果已经走完一个桶，如何走到下一个桶中?

对operator++来说，由于我们需要处理一中情况，也就是当一个哈希桶走完之后，就要往下一个桶走，为了确定下一个桶，我们就需要获取当前的_table[index]，那就需要一个当前的HashTable对象，于是我们在迭代器中，需要利用友元获取当前对象的size属性，还有在构造器中传入当前对象指针，来确定对象

Self& operator++()
{
    //1.当前桶中还有数据,直接往后走
    if (_node->_next)
    {
        _node = _node->_next;
    }
    //2.当前走完了
    else
    {
        //走到下一个桶中
        Key_Of_T kot;
        HashFunc hf;
        size_t index = hf(kot(_node->_data)) % _pht->_table.size();
        ++index;
        //要找到有数据的桶
        while (index < _pht->_table.size())
        {
            if (_pht->_table[index])
            {
                _node = _pht->_table[index];
                return *this;
            }
            else
            {
                ++index;
            }
        }
        _node = nullptr;
    }
    return *this;
}

operator–要实现吗，其实库里也没有提供operator–，没有提供rend和rbegin，说明库里也没有反向迭代器，所以说一般没有–操作

要operator--的话，那就可能需要双向链表实现

2.5.4 other operator

T& operator*()
{
    return _node->_data;
}

T* operator->()
{
    return &_node->_data;
}

bool operator != (const Self& s) const
{
    return _node != s._node;
}

bool operator == (const Self& s) const
{
    return _node == s.node;
}

2.6 迭代器based增删改查

2.6.1 Insert

pair<iterator,bool> Insert(const T& data)
{
    Key_Of_T kot;
    //有相同数据直接false
    auto ret = Find(kot(data));
    if (ret != end())
    {
        return make_pair(ret, false);
    }

    HashFunc hf;
    //负载因子，到一的时候，进行增容
    if (_n == _table.size())
    {
        vector<Node*> newtable;
        newtable.resize(GetNextPrime(_table.size()));
        //旧表节点重新算位置搞到新表
        for (size_t i=0;i<_table.size();++i)
        {
            if (_table[i])
            {
                Node* cur = _table[i];
                while (cur)
                {
                    Node* next = cur->_next;
                    size_t index = hf(kot(cur->_data)) % newtable.size();
                    //头插
                    cur->_next = newtable[index];
                    newtable[index] = cur;
                    //原表迭代
                    cur = next;
                }
                _table[i] = nullptr;
            }
        }
        _table.swap(newtable);
    }

    //没有到1，直接链接
    size_t index = hf(kot(data)) % _table.size(); 
    Node* newnode = new Node(data);
    //头插,而且也不用排空
    newnode->_next = _table[index];
    _table[index] = newnode;
    ++_n;
    return make_pair(iterator(newnode,this), true);
}

2.6.2 Find

iterator Find(const K& key)
{
    if (_table.size() ==0)
    {
        return end();
    }

    Key_Of_T kot;
    if (_table.size() == 0)
    {
        return end();
    }
    HashFunc hf;
    size_t index = hf(key) % _table.size();
    Node* cur = _table[index];
    while (cur)
    {
        if (kot(cur->_data) == key)
        {
            return iterator(cur,this);
        }
        else
        {
            cur = cur->_next;
        }
    }
    return end();
}

2.7 拷贝构造

可以不用自己写，但是由于写了拷贝构造，所以至少要说明一下

HashTable()=default;//显示指定

2.8 析构函数

//析构
~HashTable()
{
    for (size_t i = 0; i < _table.size(); ++i)
    {
        Node* cur = _table[i];
        while (cur)
        {
            Node* next = cur->_next;
            delete cur;
            cur = next;
        }
        _table[i] = nullptr;
    }
}

2.9 拷贝构造

//拷贝构造
HashTable(const HashTable& ht)	//构造和拷贝可以不写模板
{
    _n = ht._n;
    _table.resize(ht._table.size());
    for (size_t i = 0; i < ht._table.size(); i++)
    {
        Node* cur = ht._table[i];
        while (cur)
        {
            Node* copy = new Node(cur->_data);
            // 头插到新表
            copy->_next = _table[i];
            _table[i] = copy;

            cur = cur->_next;
        }
    }
}

2.9 赋值运算符重载

//赋值重载
HashTable& operator=(HashTable ht)
{
    _table.swap(ht._table);
    swap(_n, ht._n);

    return *this;
}

同时这样的话map和set就不需要自己写这些了，默认生成的就可以用了，会调用这里的

3. 封装实现unorder容器

3.1 修改HashTable

这里的封装和map、set部分很类似

template<class K, class T, class Key_Of_T,class HashFunc =Hash<K>>

3.2 unordered_map

3.2.1 基本结构

这里的仿函数还是和map很像的

template<class K,class V>
class unordered_map
{
    struct Map_Key_Of_T 
    {
        const K& operator()(const pair<K, V>& kv)
        {
            return kv.first;
        }
    };
public:
 
private:
    Open_Hash::HashTable<K, pair<K, V>, Map_Key_Of_T> _ht;
};

3.2.2 insert

pair<iterator,bool> insert(const pair<K, V>& kv)
{
    return _ht.Insert(kv);
}

3.2.3 iterator

typedef typename Open_Hash::HashTable<K, pair<K,V>, Map_Key_Of_T>::iterator iterator;
iterator begin()
{
    return _ht.begin();
}

iterator end()
{
    return _ht.end();
}

3.2.4 operator[]

map有一个专门的operator[]，如果这里要实现的话，需要先修改Insert等

V& operator[](const K& key)
{
    pair<iterator, bool> ret = _ht.Insert(make_pair(key, V()));
    return ret.first->second;
}

3.3 unordered_set

3.3.1 基本结构

template<class K>
class unordered_set
{
    struct Set_Key_Of_T
    {
        const K& operator()(const K& key)
        {
            return key;
        }
    };
public:
    bool insert(const K& key)
    {
        _ht.Insert(key);
        return true;
    }
private:
    Open_Hash::HashTable<K,  K>,Set_Key_of_T> _ht;
};

3.3.2 insert

bool insert(const K& key)
{
    _ht.Insert(key);
    return true;
}

3.3.3 iterator

typedef typename Open_Hash::HashTable<K, K, Set_Key_Of_T>::iterator iterator;

iterator begin()
{
    return _ht.begin();
}

iterator end()
{
    return _ht.end();
}