python的dictionary(map)实现

转自:http://www.laurentluce.com/posts/python-dictionary-implementation/


This post describes how dictionaries are implemented in the Python language.

Dictionaries are indexed by keys and they can be seen as associative arrays. Let’s add 3 key/value pairs to a dictionary:

1 >>> d = {'a': 1, 'b': 2}
2 >>> d['c'] = 3
3 >>> d
4 {'a': 1, 'b': 2, 'c': 3}

The values can be accessed this way:

01 >>> d['a']
02 1
03 >>> d['b']
04 2
05 >>> d['c']
06 3
07 >>> d['d']
08 Traceback (most recent call last):
09   File "<stdin>", line 1, in <module>
10 KeyError: 'd'

The key ‘d’ does not exist so a KeyError exception is raised.

Hash tables

Python dictionaries are implemented using hash tables. It is an array whose indexes are obtained using a hash function on the keys. The goal of a hash function is to distribute the keys evenly in the array. A good hash function minimizes the number of collisions e.g. different keys having the same hash.

We are going to assume that we are using strings as keys for the rest of this post. The hash function for strings in Python is defined as:

01 arguments: string object
02 returns: hash
03 function string_hash:
04     if hash cached:
05         return it
06     set len to string's length
07     initialize var p pointing to 1st char of string object
08     set x to value pointed by p left shifted by 7 bits
09     while len >= 0:
10         set var x to (1000003 * x) xor value pointed by p
11         increment pointer p
12     set x to x xor length of string object
13     cache x as the hash so we don't need to calculate it again
14     return x as the hash

If you run hash(‘a’) in Python, it will execute string_hash() and return 12416037344. Here we assume we are using a 64-bit machine.

If an array of size x is used to store the key/value pairs then we use a mask equal to x-1 to calculate the slot index of the pair in the array. For example, if the size of the array is 8, the index for ‘a’ will be: hash(‘a’) & 7 = 0. The index for ‘b’ is 3, the index for ‘c’ is 2, the index for ‘z’ is 3 which is the same as ‘b’, here we have a collision.

hash table

We can see that the Python hash function does a good job when the keys are consecutive which is good because it is quite common to have this type of data to work with. However, once we add the key ‘z’, there is a collision because it is not consecutive enough.

We could use a linked list to store the pairs having the same hash but it would increase the lookup time e.g. not O(1) anymore. The next section describes the collision resolution method used in the case of Python dictionaries.

Open addressing

Open addressing is a method of collision resolution where probing is used. In case of ‘z’, the slot index 3 is already used in the array so we need to probe for a different index to find one which is not already used. Adding a key/value pair might take more time because of the probing but the lookup will be O(1) and this is the desired behavior.

A quadratic probing sequence is used to find a free slot. The code is the following:

1 i is the current slot index
2 set perturb to hash
3 forever loop:
4   set i to i << 2 + i + perturb + 1
5   set slot index to i & mask
6   if slot is free:
7       return it
8   right shift perturb by 5 bits

Let’s see how this probing works when we start with i = 3:
3 -> 3 -> 5 -> 5 -> 6 -> 0…
This is not a really good example because we are using a small table of size 8 and this probing starts showing its advantages on larger tables. In our case, index 5 is free so it will be picked for key ‘z’.

Just out of curiosity, let’s look at the probing sequence when the table size is 32 e.g. mask = 31
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…

You can read more about this probing sequence by looking at the source code of dictobject.c. A detailed explanation of the probing mechanism can be found at the top of the file.

open addressing

Now, let’s look at the Python internal code along with an example.

Dictionary C structures

The following C structure is used to store a dictionary entry: key/value pair. The hash, key and value are stored. PyObject is the base class of the Python objects.

1 typedef struct {
2     Py_ssize_t me_hash;
3     PyObject *me_key;
4     PyObject *me_value;
5 } PyDictEntry;

The following structure represents a dictionary. ma_fill is the number of used slots + dummy slots. A slot is marked dummy when a key pair is removed. ma_used is the number of used slots (active). ma_mask is equal to the array’s size minus 1 and is used to calculate the slot index. ma_table is the array and ma_smalltable is the initial array of size 8.

01 typedef struct _dictobject PyDictObject;
02 struct _dictobject {
03     PyObject_HEAD
04     Py_ssize_t ma_fill;
05     Py_ssize_t ma_used;
06     Py_ssize_t ma_mask;
07     PyDictEntry *ma_table;
08     PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
09     PyDictEntry ma_smalltable[PyDict_MINSIZE];
10 };

Dictionary initialization

When you first create a dictionary, the function PyDict_New() is called. I removed some of the lines and converted the C code to pseudocode to concentrate on the key concepts.

1 returns new dictionary object
2 function PyDict_New:
3     allocate new dictionary object
4     clear dictionary's table
5     set dictionary's number of used slots + dummy slots (ma_fill) to 0
6     set dictionary's number of active slots (ma_used) to 0
7     set dictionary's mask (ma_value) to dictionary size - 1 = 7
8     set dictionary's lookup function to lookdict_string
9     return allocated dictionary object

Adding items

When a new key/value pair is added, PyDict_SetItem() is called. This function takes a pointer to the dictionary object and the key/value pair. It checks if the key is a string and calculates the hash or reuses the one cached if it exists. insertdict() is called to add the new key/value pair and the dictionary is resized if the number of used slots is greater than 2/3 of the array’s size.
Why 2/3? It is to make sure the probing sequence can find a free slot fast enough. We will look at the resizing function later.

01 arguments: dictionary, key, value
02 returns: 0 if OK or -1
03 function PyDict_SetItem:
04     set mp to point to dictionary object
05     if key's hash cached:
06         use hash
07     else:
08         calculate hash
09     set n_used to dictionary's number of active slots (ma_used)
10     call insertdict with dictionary object, key, hash and value
11     if key/value pair added successfully and capacity over 2/3:
12         call dictresize to resize dictionary's table

inserdict() uses the lookup function to find a free slot. This is the next function we are going to examine. lookdict_string() calculates the slot index using the hash and the mask values. If it cannot find the key in the slot index = hash & mask, it probes using the perturb loop we saw above.

01 arguments: dictionary object, key, hash
02 returns: dictionary entry
03 function lookdict_string:
04     calculate slot index based on hash and mask
05     if slot's key matches or slot's key is not set:
06         returns slot's entry
07     if slot's key marked as dummy (was active):
08         set freeslot to this slot's entry
09     else:
10         if slot's hash equals to hash and slot's key equals to key:
11             return slot's entry
12         set var freeslot to null
13     we are here because we couldn't find the key so we start probing
14     set perturb to hash
15     forever loop:
16         set i to i << 2 + i + perturb + 1
17         calculate slot index based on i and mask
18         if slot's key is null:
19             if freeslot is null:
20                 return slot's entry
21             else:
22                 return freeslot
23         if slot's key equals to key or slot's hash equals to hash
24             and slot is not marked as dummy:
25             return slot's entry
26         if slot marked as dummy and freeslot is null:
27             set freeslot to slot's entry
28         right shift perturb by 5 bits

We want to add the following key/value pairs: {‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}. This is what happens:

A dictionary structure is allocated with internal table size of 8.

  • PyDict_SetItem: key = ‘a’, value = 1
    • hash = hash(‘a’) = 12416037344
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12416037344 & 7 = 0
        • slot 0 is not used so return it
      • init entry at index 0 with key, value and hash
      • ma_used = 1, ma_fill = 1
  • PyDict_SetItem: key = ‘b’, value = 2
    • hash = hash(‘b’) = 12544037731
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12544037731 & 7 = 3
        • slot 3 is not used so return it
      • init entry at index 3 with key, value and hash
      • ma_used = 2, ma_fill = 2
  • PyDict_SetItem: key = ‘z’, value = 26
    • hash = hash(‘z’) = 15616046971
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15616046971 & 7 = 3
        • slot 3 is used so probe for a different slot: 5 is free
      • init entry at index 5 with key, value and hash
      • ma_used = 3, ma_fill = 3
  • PyDict_SetItem: key = ‘y’, value = 25
    • hash = hash(‘y’) = 15488046584
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15488046584 & 7 = 0
        • slot 0 is used so probe for a different slot: 1 is free
      • init entry at index 1 with key, value and hash
      • ma_used = 4, ma_fill = 4
  • PyDict_SetItem: key = ‘c’, value = 3
    • hash = hash(‘c’) = 12672038114
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12672038114 & 7 = 2
        • slot 2 is free so return it
      • init entry at index 2 with key, value and hash
      • ma_used = 5, ma_fill = 5
  • PyDict_SetItem: key = ‘x’, value = 24
    • hash = hash(‘x’) = 15360046201
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15360046201 & 7 = 1
        • slot 1 is used so probe for a different slot: 7 is free
      • init entry at index 7 with key, value and hash
      • ma_used = 6, ma_fill = 6

This is what we have so far:

python dictionary insert

6 slots on 8 are used now so we are over 2/3 of the array’s capacity. dictresize() is called to allocate a larger array. This function also takes care of copying the old table entries to the new table.

dictresize() is called with minused = 24 in our case which is 4 * ma_used. 2 * ma_used is used when the number of used slots is very large (greater than 50000). Why 4 times the number of used slots? It reduces the number of resize steps and it increases sparseness.

The new table size needs to be greater than 24 and it is calculated by shifting the current size 1 bit left until it is greater than 24. It ends up being 32 e.g. 8 -> 16 -> 32.

01 arguments: dictionary object, (2 or 4) * active slots
02 returns: 0 if OK, -1 otherwise
03 function dictresize:
04     calculate new dictionary size:
05         set var newsize to dictionary size
06         while newsize less or equal than (2 or 4) * active slots:
07             set newsize to newsize left shifted by 1 bit
08     set oldtable to dictionary's table
09     allocate new dictionary table
10     set dictionary's mask to newsize - 1
11     clear dictionary's table
12     set dictionary's active slots (ma_used) to 0
13     set var i to dictionary's active + dummy slots (ma_fill)
14     set dictionary's active + dummy slots (ma_fill) to 0
15     copy oldtable entries to dictionary's table using new mask
16     return 0
17 }

This is what happens with our table during resizing: a new table of size 32 is allocated. Old table entries are inserted into the new table using the new mask value which is 31. We end up with the following:

python dictionary table resizing

Removing items

PyDict_DelItem() is called to remove an entry. The hash for this key is calculated and the lookup function is called to return the entry. The key for this entry is set to the dummy key. Dummy entries are entries which contained a key in the past but have not been reused yet. The probe sequence use the dummy information in case of collision to know that those entries held an active pair in the past.

01 arguments: dictionary object, key
02 returns 0 if OK, -1 otherwise
03 function PyDict_DelItem:
04     if key's hash cached:
05         use hash
06     else:
07         calculate hash
08     look for key in dictionary using hash
09     if slot not found:
10         return -1
11     set slot's key to dummy
12     set slot's value to null
13     decrement dictionary active slots
14     return 0

We want to remove the key ‘c’ from our dictionary. We end up with the following array:

Python dictionary delete key

Note that the delete item operation doesn’t trigger an array resize if the number of used slots is much less that the total number of slots. However, when a key/value pair is added, the need for resize is based on the number of used slots so it can shrink the array too if the new size is the original size of the array: 8 by default.

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback. If you need help with a project written in Python or with building a new web service, I am available as a freelancer: LinkedIn profile. Follow me on Twitter @laurentluce.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值