Lecture 1 Course Intro and Hashing
Aim: how to design and analyze a variety of algorithms.
Grad algorithms is a lot about algorithms discovered since 1990. Gradual shift in the emphasis and goals of CS as it became a more mature field. It enjoys a broadened horizon and started looking at new problems, like big data, e-commerce and bioinformatics.
A lot more realistic things are required to concern:
- Changing Graph: formulation is not easy
- Changing Data Structures: data come from sources we do not control, e.g., noisy or inexact.
- Changing notion of I/O: datas may from datastreams, online, social network graphs, etc. Hard to grasp the appropriate output.
- Analysis: exact, work on all inputs-> approximation
Hashing
Preliminaries
存储一个巨大(e.g., 2322^{32}232)集合U的子集SSS: ∣S∣=m|S|=m∣S∣=m. 我们希望在UUU中支持对SSS的三种运算:插入,删除和询问,那么,哈希表正是我们需要的。
定义一个哈希函数h:U→[n]h: U\rightarrow [n]h:U→[n],我们只需要开放n个地址来存储哈希值为0∼n−10\sim n-10∼n−1的数据,并且碰撞元素用链表连接。
两种假设:1. 输入随机 2. 输入给定,哈希函数随机
Hash函数
我们如何定义一个函数是随机的?
理想中,对给定的x1,...,xm∈Sx_1,...,x_m\in Sx1,...,xm∈S,任意的a1,...,am∈[n]a_1,...,a_m\in[n]a1,...,am∈[n],随机函数H\mathcal{H}H应当满足:
- Prh∈H[h(x1)=a1]=1n\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1\right]=\frac{1}{n}Prh∈H[h(x1)=a1]=n1.
- Prh∈H[h(x1)=a1∧h(x2)=a2]=1n2\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2\right]=\frac{1}{n^2}Prh∈H[h(x1)=a1∧h(x2)=a2]=n21. Pairwise independence.
- Prh∈H[h(x1)=a1∧h(x2)=a2∧⋯∧h(xk)=ak]=1nk\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2 \wedge \cdots \wedge h\left(x_k\right)=a_k\right]=\frac{1}{n^k}Prh∈H[h(x1)=a1∧h(x2)=a2∧⋯∧h(xk)=ak]=nk1. kkk-wise independence.
- Prh∈H[h(x1)=a1∧h(x2)=a2∧⋯∧h(xm)=am]=1nm\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2 \wedge \cdots \wedge h\left(x_m\right)=a_m\right]=\frac{1}{n^m}Prh∈H[h(x1)=a1∧h(x2)=a2∧⋯∧h(xm)=am]=nm1. Full independence (note that ∣U∣=m|U|=m∣U∣=m ). In this case we have nmn^mnm possible hhh (we store h(x)h(x)h(x) for each x∈U)x \in U)x∈U), so we need mlognm \log nmlogn bits to represent the each hash function. Since mmm is usually very large, this is not practical.
For any xxx, let LxL_xLx be the length of the linked list containing xxx, then LxL_xLx is just the number of elements with the same hash value as xxx. Let random variable
Iy={1 if h(y)=h(x)0 otherwise
I_y= \begin{cases}1 & \text { if } h(y)=h(x) \\ 0 & \text { otherwise }\end{cases}
Iy={10 if h(y)=h(x) otherwise
So Lx=1+∑y≠xIyL_x=1+\sum_{y \neq x} I_yLx=1+∑y=xIy, and
E[Lx]=1+∑y≠xE[Iy]=1+m−1n
E\left[L_x\right]=1+\sum_{y \neq x} E\left[I_y\right]=1+\frac{m-1}{n}
E[Lx]=1+y=x∑E[Iy]=1+nm−1
Note that we don’t need full independence to prove this property, and pairwise independence would actually suffice.
2-Universal Hash
定义函数族H\mathcal{H}H是2-universal的,如果对任意x≠y∈Ux\neq y\in Ux=y∈U, Prh∈H[h(x)=h(y)]≤1nPr_{h\in H}[h(x)=h(y)]\leq \frac{1}{n}Prh∈H[h(x)=h(y)]≤n1. (比2-independence要弱)
构造:选取一个质数p∈[∣U∣,2∣U∣]p\in [|U|,2|U|]p∈[∣U∣,2∣U∣],令fa,b(x)=ax+bmod pf_{a,b}(x)=ax+b \mod pfa,b(x)=ax+bmodp(a≠0a\neq 0a=0)即可
Since [p][p][p] constitutes a finite field, we have that a=(x1−x2)−1(s−t)a=\left(x_1-x_2\right)^{-1}(s-t)a=(x1−x2)−1(s−t) and b=s−ax1b=s-a x_1b=s−ax1. Since we have p(p−1)p(p-1)p(p−1) different hash functions in H\mathcal{H}H in this case,
Prh∈H[h(x1)=s∧h(x2)=t]=1p(p−1)
\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=s \wedge h\left(x_2\right)=t\right]=\frac{1}{p(p-1)}
Prh∈H[h(x1)=s∧h(x2)=t]=p(p−1)1
定理: H={ha,b:a,b∈[p]∧a≠0}\mathcal{H}=\left\{h_{a, b}: a, b \in[p] \wedge a \neq 0\right\}H={ha,b:a,b∈[p]∧a=0} 是 2-universal的
证明:For any x1≠x2x_1 \neq x_2x1=x2,
Pr[ha,b(x1)=ha,b(x2)]=∑s,t∈[p],s≠tδ(s=t mod n)Pr[fa,b(x1)=s∧fa,b(x2)=t]=1p(p−1)∑s,t∈[p],s≠tδ(s=t mod n)≤1p(p−1)p(p−1)n=1n
\begin{aligned}
& \operatorname{Pr}\left[h_{a, b}\left(x_1\right)=h_{a, b}\left(x_2\right)\right] \\
= & \sum_{s, t \in[p], s \neq t} \delta_{(s=t \bmod n)} \operatorname{Pr}\left[f_{a, b}\left(x_1\right)=s \wedge f_{a, b}\left(x_2\right)=t\right] \\
= & \frac{1}{p(p-1)} \sum_{s, t \in[p], s \neq t} \delta_{(s=t \bmod n)} \\
\leq & \frac{1}{p(p-1)} \frac{p(p-1)}{n} \\
= & \frac{1}{n}
\end{aligned}
==≤=Pr[ha,b(x1)=ha,b(x2)]s,t∈[p],s=t∑δ(s=tmodn)Pr[fa,b(x1)=s∧fa,b(x2)=t]p(p−1)1s,t∈[p],s=t∑δ(s=tmodn)p(p−1)1np(p−1)n1
注意到,如果选取nnn充分大,使得n≥m2n\geq m^2n≥m2,那么hash表中的碰撞数量期望为:
E[∑x1≠x2h(x1)=h(x2)]≤(m2)1n≤12E[\sum_{x_1\neq x_2} h(x_1)=h(x_2)]\leq \binom{m}{2}\frac{1}{n}\leq \frac{1}{2}E[∑x1=x2h(x1)=h(x2)]≤(2m)n1≤21.
不过,我们还有更简单的办法:双人博弈哈希表。假设第i个位置有sis_isi个碰撞,那么我们构造一个大小为si2s_i^2si2的哈希表,即可容纳所有碰撞。
注意到,E(∑isi2)=E(∑isi(si−1))+E(∑isi)=m(m−1)n+m≤2mE(\sum_i s_i^2)=E(\sum_i s_i(s_i-1))+E(\sum_i s_i) =\frac{m(m-1)}{n}+m\leq 2mE(∑isi2)=E(∑isi(si−1))+E(∑isi)=nm(m−1)+m≤2m.
负载均衡
在负载均衡问题中,我们可以想象我们尝试把球放到桶里。如果我们有n个球和n个桶,并且随机放,那么第i个桶有k个球的概率≤(nk)1nk\leq \binom{n}{k}\frac{1}{n^k}≤(kn)nk1.(把其他桶看成一个整体,用几何分布)≤1k!\leq \frac{1}{k!}≤k!1.
根据斯特林公式,选取k=O(lognloglogn)k=O(\frac{\log n}{\log \log n})k=O(loglognlogn),我们有1k!≤1n2\frac{1}{k!}\leq \frac{1}{n^2}k!1≤n21. 因此,存在一个桶有k个球的概率≤1n\leq \frac{1}{n}≤n1.
因此,以≤1−1n\leq 1-\frac{1}{n}≤1−n1的概率,我们的最大负载不超过O(lognloglogn)O(\frac{\log n}{\log \log n})O(loglognlogn).
改进:当球到来时,随机选两个桶,并且把球放在有更少的球的桶里,那么最大负载以高概率不超过O(loglogn)O(\log \log n)O(loglogn),这是一个巨大的改进!
本文介绍了计算机科学中算法设计与分析的重要性,特别是随着领域扩展到大数据、电子商务和生物信息学等新问题。重点讨论了哈希在处理大规模集合操作中的应用,包括插入、删除和查询。哈希函数的随机性和独立性被探讨,特别是2-universal哈希函数的概念,以及如何通过选择合适的哈希函数来减少碰撞。此外,还讨论了在负载均衡问题中优化桶的填充策略,以保证最大负载的效率。

被折叠的 条评论
为什么被折叠?



