CS521 Advanced Algorithm Design 学习笔记（一）Course Intro and Hashing

本文介绍了计算机科学中算法设计与分析的重要性，特别是随着领域扩展到大数据、电子商务和生物信息学等新问题。重点讨论了哈希在处理大规模集合操作中的应用，包括插入、删除和查询。哈希函数的随机性和独立性被探讨，特别是2-universal哈希函数的概念，以及如何通过选择合适的哈希函数来减少碰撞。此外，还讨论了在负载均衡问题中优化桶的填充策略，以保证最大负载的效率。

Lecture 1 Course Intro and Hashing

Aim: how to design and analyze a variety of algorithms.

Grad algorithms is a lot about algorithms discovered since 1990. Gradual shift in the emphasis and goals of CS as it became a more mature field. It enjoys a broadened horizon and started looking at new problems, like big data, e-commerce and bioinformatics.

A lot more realistic things are required to concern:

Changing Graph: formulation is not easy
Changing Data Structures: data come from sources we do not control, e.g., noisy or inexact.
Changing notion of I/O: datas may from datastreams, online, social network graphs, etc. Hard to grasp the appropriate output.
Analysis: exact, work on all inputs-> approximation

Hashing

Preliminaries

存储一个巨大(e.g., $2^{32}$ )集合U的子集 $S$ : $∣ S ∣ = m$ . 我们希望在 $U$ 中支持对 $S$ 的三种运算：插入，删除和询问，那么，哈希表正是我们需要的。

定义一个哈希函数 $U\rightarrow [n]$ ，我们只需要开放n个地址来存储哈希值为 $0∼n−10\sim n-1$ 的数据，并且碰撞元素用链表连接。

两种假设：1. 输入随机 2. 输入给定，哈希函数随机

Hash函数

我们如何定义一个函数是随机的？

理想中，对给定的 $x1,...,xm∈Sx_1,...,x_m\in S$ ，任意的 $a1,...,am∈[n]a_1,...,a_m\in[n]$ ，随机函数 $H\mathcal{H}$ 应当满足：

$Pr⁡h∈H[h(x1)=a1]=1n\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1\right]=\frac{1}{n}$ .
$Pr⁡h∈H[h(x1)=a1∧h(x2)=a2]=1n2\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2\right]=\frac{1}{n^2}$ . Pairwise independence.
$Pr⁡h∈H[h(x1)=a1∧h(x2)=a2∧⋯∧h(xk)=ak]=1nk\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2 \wedge \cdots \wedge h\left(x_k\right)=a_k\right]=\frac{1}{n^k}$ . $k$ -wise independence.
$Pr⁡h∈H[h(x1)=a1∧h(x2)=a2∧⋯∧h(xm)=am]=1nm\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=a_1 \wedge h\left(x_2\right)=a_2 \wedge \cdots \wedge h\left(x_m\right)=a_m\right]=\frac{1}{n^m}$ . Full independence (note that $∣ U ∣ = m$ ). In this case we have $n^m$ possible $h$ (we store $h (x)$ for each $\in U)$ , so we need $\log n$ bits to represent the each hash function. Since $m$ is usually very large, this is not practical.

For any $x$ , let $L_x$ be the length of the linked list containing $x$ , then $L_x$ is just the number of elements with the same hash value as $x$ . Let random variable
$I_y= \begin{cases}1 & \text { if } h(y)=h(x) \\ 0 & \text { otherwise }\end{cases}$
So $Lx=1+∑y≠xIyL_x=1+\sum_{y \neq x} I_y$ , and
$E\left[L_x\right]=1+\sum_{y \neq x} E\left[I_y\right]=1+\frac{m-1}{n}$
Note that we don’t need full independence to prove this property, and pairwise independence would actually suffice.

2-Universal Hash

定义函数族 $H\mathcal{H}$ 是2-universal的，如果对任意 $x≠y∈Ux\neq y\in U$ , $Prh∈H[h(x)=h(y)]≤1nPr_{h\in H}[h(x)=h(y)]\leq \frac{1}{n}$ . （比2-independence要弱）

构造：选取一个质数 $p∈[∣U∣,2∣U∣]p\in [|U|,2|U|]$ ，令 $f_{a,b}(x)=ax+b \mod p$ （ $a≠0a\neq 0$ ）即可

Since $[p]$ constitutes a finite field, we have that $a=(x1−x2)−1(s−t)a=\left(x_1-x_2\right)^{-1}(s-t)$ and $b=s-a x_1$ . Since we have $p (p - 1)$ different hash functions in $H\mathcal{H}$ in this case,
$\operatorname{Pr}_{h \in \mathcal{H}}\left[h\left(x_1\right)=s \wedge h\left(x_2\right)=t\right]=\frac{1}{p(p-1)}$
定理： $H={ha,b:a,b∈[p]∧a≠0}\mathcal{H}=\left\{h_{a, b}: a, b \in[p] \wedge a \neq 0\right\}$ 是 2-universal的
证明：For any $x1≠x2x_1 \neq x_2$ ,
$\begin{aligned} & \operatorname{Pr}\left[h_{a, b}\left(x_1\right)=h_{a, b}\left(x_2\right)\right] \\ = & \sum_{s, t \in[p], s \neq t} \delta_{(s=t \bmod n)} \operatorname{Pr}\left[f_{a, b}\left(x_1\right)=s \wedge f_{a, b}\left(x_2\right)=t\right] \\ = & \frac{1}{p(p-1)} \sum_{s, t \in[p], s \neq t} \delta_{(s=t \bmod n)} \\ \leq & \frac{1}{p(p-1)} \frac{p(p-1)}{n} \\ = & \frac{1}{n} \end{aligned}$
注意到，如果选取 $n$ 充分大，使得 $n≥m2n\geq m^2$ ，那么hash表中的碰撞数量期望为：

$E[∑x1≠x2h(x1)=h(x2)]≤(m2)1n≤12E[\sum_{x_1\neq x_2} h(x_1)=h(x_2)]\leq \binom{m}{2}\frac{1}{n}\leq \frac{1}{2}$ .

不过，我们还有更简单的办法：双人博弈哈希表。假设第i个位置有 $s_i$ 个碰撞，那么我们构造一个大小为 $s_i^2$ 的哈希表，即可容纳所有碰撞。

注意到， $E(∑isi2)=E(∑isi(si−1))+E(∑isi)=m(m−1)n+m≤2mE(\sum_i s_i^2)=E(\sum_i s_i(s_i-1))+E(\sum_i s_i) =\frac{m(m-1)}{n}+m\leq 2m$ .

负载均衡

在负载均衡问题中，我们可以想象我们尝试把球放到桶里。如果我们有n个球和n个桶，并且随机放，那么第i个桶有k个球的概率 $≤(nk)1nk\leq \binom{n}{k}\frac{1}{n^k}$ .（把其他桶看成一个整体，用几何分布） $≤1k!\leq \frac{1}{k!}$ .

根据斯特林公式，选取 $k=O(log⁡nlog⁡log⁡n)k=O(\frac{\log n}{\log \log n})$ ，我们有 $1k!≤1n2\frac{1}{k!}\leq \frac{1}{n^2}$ . 因此，存在一个桶有k个球的概率 $≤1n\leq \frac{1}{n}$ .