MIT Introduction to Algorithms Record-8

最新推荐文章于 2024-11-06 04:29:48 发布

gehancoder

最新推荐文章于 2024-11-06 04:29:48 发布

阅读量416

点赞数 2

CC 4.0 BY-SA版权

分类专栏： Algorithm 文章标签： Algorithm hash

本文链接：https://blog.youkuaiyun.com/myxyhg/article/details/51695097

Algorithm 专栏收录该内容

1 篇文章

订阅专栏

本文探讨了哈希函数的弱点及解决方法，通过随机选取哈希函数降低碰撞概率，介绍了构造普适哈希函数的方法，并展示了如何利用两层哈希方案实现完美哈希，确保最坏情况下的搜索效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Weakness of hashing: For any choice of hash function, it always exist bad set of keys that all hash to same slot.
Idea: Choose hash function at random,independently from the keys.

Universal Hashing

Definition1: Let $U$ be a universal of keys, and let $H$ be a finite collection of hash functions,mapping $U$ to $\{0,1,...,m-1\}$ .

So $H$ is universal:

if \forall x, y \in U, where x \neq y, then {h \in H, h (x) = h (y)} = | H | m

$\text{if }\forall x,y \in U,\text{where $x \neq y$},\text{then } \{h \in H,h(x) = h(y)\}=\frac{\vert{H}\vert} {m}$ .
I.e. if

h $h$ is chosen randomly from

H $H$ , the probability of collision between

x $x$ and

y $y$ is

1x $\frac1x$ .

Theorem1:
Choose $h$ randomly from $H$ , suppose hashing $n$ keys into $m$ slots in Table T, then for a given key $x$ ,its expect number of collisions with $x$ is:

E [# collisions with x] < n m

$E\left[\# \text{collisions with $x$}\right]\lt \frac{n}{m}$

Proof:
Let $c_x$ be random variable, the total number collisions of keys in T with $x$ , and let

c x y = {1, 0, if h (x) = h (y) otherwise

$c_{xy}=\begin{cases} 1,&\text{if $h(x)=h(y)$} \\ 0,&\text{otherwise} \end{cases}$
Note:

E[cxy]=1m $E[c_{xy}]=\frac1m$ and

Cx=∑y∈T−{x}cxy $C_x=\sum_{y\in T-\{x\}} c_{xy}$ ,and y is the element not equal with x in Table T.

So:

E [C x] = E [\sum y \in T - {x} c x y] = \sum y \in T - {x} E [c x y] = \sum y \in T - {x} 1 m = n - 1 m

$E[C_x]=E[\sum_{y\in T-\{x\}} c_{xy}] =\sum_{y\in T-\{x\}} E[c_{xy}] =\sum_{y\in T-\{x\}}\frac1m=\frac{n-1}m$

Constructing a universal hash function

Let $m$ be prime, decompose key $k$ into $r+1$ digits, so

k = ⟨ k 0, k 1, . . ., k r ⟩ where 0 \leq k i \leq m - 1

$k=\langle{k_0,k_1,...,k_r}\rangle\ \text{where $0\le k_i \le m-1$}$
in here we treating

k $k$ as an

r+1 $r+1$ base

m $m$ number.
Now we’re going to pick an

a $a$ at random which we’re also going to look at as a base

m $m$ number, so pick

a=⟨a0,a1,...,ar⟩ $a=\langle{a_0,a_1,...,a_r}\rangle$ , each

ai $a_i$ is chosen randomly from

{0,1,...,m−1} $\{0,1,...,m-1\}$ , so the hash function as below:

Definition2:

h a (k) = (\sum i = 0 r a i k i) % m

$h_a(k)=(\sum_{i=0}^r a_ik_i)\%m$

We want to know how big is the set of hash function here? how many different hash functions do I have in this set?
Conclusion:

| H | = m r + 1

$\vert{H}\vert = m^{r+1}$
Explanation:
Because it have

m $m$ choices for each

ai $a_i$ (

0≤ai≤m−1 $0 \le a_i \le m-1$ ), and for

a $a$ vector has

r+1 $r+1$ elements from

a=⟨a0,a1,...,ar⟩ $a=\langle{a_0,a_1,...,a_r}\rangle$ .

Theorem2: $H$ is universal.

Proof:
Let $x=\langle x_0,x_1,...,x_r\rangle$ , $y=\langle y_0,y_1,...,y_r\rangle$ be distinct keys and they differ $m$ at least one digit. They could differ in any one of these digits, in here we assume they differ in position 0.

Question: How many hash functions in universal $h_a(x) \in H$ do $x$ and $y$ collide?

It must have $h_a(x) = h_a(y)$ if they collide.

\Rightarrow (\sum i = 0 r a i x i) % m \equiv (\sum i = 0 r a i y i) % m

$\Rightarrow \quad \left(\sum_{i=0}^r a_ix_i\right)\%m \equiv \left(\sum_{i=0}^r a_iy_i\right)\%m$

\Rightarrow (\sum i = 0 r a i x i) \equiv (\sum i = 0 r a i y i) (% m) //do module m after everything is being done.

$\Rightarrow \quad \left(\sum_{i=0}^r a_ix_i\right)\ \equiv \left(\sum_{i=0}^r a_iy_i\right) \quad (\%m) \quad\text{//do module m after everything is being done. }$

\Rightarrow \sum i = 0 r a i (x i - y i) \equiv 0 (% m)

$\Rightarrow \quad \sum_{i=0}^r a_i(x_i-y_i) \equiv 0 \quad (\%m)$

\Rightarrow a 0 (x 0 - y 0) + \sum i = 1 r a i (x i - y i) \equiv 0 (% m)

$\Rightarrow \quad a_0(x_0-y_0)+\sum_{i=1}^r a_i(x_i-y_i) \equiv 0 \quad (\%m)$

\Rightarrow a 0 (x 0 - y 0) = - \sum i = 1 r a i (x i - y i) (% m)

$\Rightarrow \quad a_0(x_0-y_0)=-\sum_{i=1}^r a_i(x_i-y_i) \quad (\%m)$
since

x0≠y0 $x_0 \neq y_0$ , so

∃(x0−y0)−1 $\exists (x_0-y_0)^{-1}$ to make following formula is true according with Number Theory Fact.

a 0 = (- \sum i = 1 r a i (x i - y i)) (x 0 - y 0) - 1

$a_0=\left(-\sum_{i=1}^r a_i(x_i-y_i)\right)(x_0-y_0)^{-1}$
Number Theory Fact:

let $m$ be prime, for any $z \in \mathcal {Z_m}\ \text{($\mathcal {Z_m}$ are intergers module $m$)}$ , so for any $z$ is not congruent to 0, there exists a unique $z$ inverse in $\mathcal {Z_m}$ , such that if I multiply $z$ times the inverse, it produces something congruent to one mod $m$ .
I.e.
$$ z \equiv̸ 0, \exists unique z - 1 \in Z m \Rightarrow $ z z - 1 \equiv 1 (mod m)$ $\$z \not \equiv 0, \exists\ \text{unique}\ z^{-1} \in \mathcal {Z_m}\ \Rightarrow \$zz^{-1} \equiv 1\ \text{(mod $m$)}$ .

Conclusion:
Thus, for any choice of $a_1,a_2,...,a_r$ exactly 1 of the $m$ choices for $a_0$ cause $x$ and $y$ to collide, and no collision for other $m-1$ choices for $a_0$ .
So the number of $h_a$ ’s that cause $x$ , $y$ to collide:

\Rightarrow m * m . . . * m * 1 = m r = | H | m

$\Rightarrow \quad m*m...*m*1=m^r=\frac {\vert H \vert}m$
because

a1 $a_1$ has

m $m$ choices, and

m $m$ choices for

a2 $a_2$ …, but only 1 choice for

a0 $a_0$ if want to cause collision.

Perfect Hashing

Situation: Given $n$ keys construct a static hash table of size $m=O (n)$ , such that search takes $O(1)$ time in the worst case.

Idea: Use a 2-level scheme with universal hashing at both levels. And the idea is that we’re going to do it in such a way that we have no collisions at level 2 and we’ll take any collides at level 1.
If $n_i$ items hash to level-1 slot $i$ , then use $m_i=n_i^2$ slots in level-2 table $s_i$ .

Level-2 Analysis

Theorem: Hash $n$ keys into $m=n^2$ slots using random $h$ in universal $H$ , we can get

E [# collisions] < 1 2

$E[\#\text{collisions}] \lt \frac{1}{2}$

Proof: Probability 2 given keys collide under $h$ is $\frac1m=\frac1n^2$ .

E [# collisions] = (2 n) * 1 n 2

$E[\#\text{collisions}]={2 \choose n}*\frac1n^2$

\Rightarrow = n ( n - 1 ) 2 1 n 2 < 1 2

$\Rightarrow \quad =\frac{n(n-1)}2 \frac1n^2 \lt \frac{1}{2}$
Note:

(2n)=C2n ${2 \choose n}=C_n^2$ .

Markov inequality

For randomly variable $x \ge 0 ,\ \text{Pr{$x \ge t$} $\le$ $\frac{E[x]}{t}$}$ .

Proof:

E [x] = \sum x = 0 \infty x * Pr{X = x} \geq \sum x = t \infty x * Pr{X = x}

$E[x]=\sum_{x=0}^\infty x*\text{Pr{$X=x$}}\ \ge \sum_{x=t}^\infty x*\text{Pr{$X=x$}}$

\Rightarrow \geq \sum x = t \infty t * Pr{X = x} = t * Pr{X \geq t}

$\Rightarrow \quad \ge \sum_{x=t}^\infty t*\text{Pr{$X=x$}}=t*\text{Pr{$X \ge t$}}$

Corollary:

Pr{no collisions} \geq 1 2

$\text{Pr{no collisions} $\ge$ $1 \over 2$}$
Now we can use the Markov inequality theorem to prove that the corollary is correct.

Proof:

Pr{\geq 1 collisions} \leq E [ # c o l l i s i o n s ] 1 < 1 2

$\text{Pr{$\ge 1$ collisions}} \le \frac {E[\# collisions]}{1} \lt \frac {1}{2}$

\Rightarrow Pr{no collisions} = 1 - Pr{\geq 1 collisions} \geq 1 2

$\Rightarrow \quad \text{Pr{no collisions}}=1-\text{Pr{$\ge 1$ collisions}} \ge \frac{1}{2}$

Conclusion: So we can know that to find a good level-2 hash function, just test a few at random, and we’ll find one quickly since at least half will work.

Analysis of storage

For level-1 choose $m=n$ , and let $n_i$ be the random variable for the number of keys that hash to slot $i$ in table T, use $m_i=n_i^2$ slots in each level-2 table $s_i$ , so