数据挖掘与分析课程笔记
- 参考教材:Data Mining and Analysis : MOHAMMED J.ZAKI, WAGNER MEIRA JR.
笔记目录
Chapter 1 :准备
1.1 数据矩阵
Def.1. 数据矩阵是指一个
(
n
×
d
)
(n\times d)
(n×d) 的矩阵
D
=
(
X
1
X
2
⋯
X
d
x
1
x
11
x
12
⋯
x
1
d
x
2
x
21
x
22
⋯
x
2
d
⋮
⋮
⋮
⋱
⋮
x
n
x
n
1
x
n
2
⋯
x
n
d
)
\mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right)
D=⎝⎜⎜⎜⎜⎜⎛x1x2⋮xnX1x11x21⋮xn1X2x12x22⋮xn2⋯⋯⋯⋱⋯Xdx1dx2d⋮xnd⎠⎟⎟⎟⎟⎟⎞
行:实体,列:属性
Ex. 鸢尾花数据矩阵
(
萼
片
长
萼
片
宽
花
瓣
长
花
瓣
宽
类
别
X
1
X
2
X
3
X
4
X
5
x
1
5.9
3.0
4.2
1.5
云
芝
)
\left(\begin{array}{c|ccccc} & 萼片长 & 萼片宽 & 花瓣长 & 花瓣宽 & 类别 \\ & X_{1} & X_{2} & X_{3} & X_{4} & X_{5} \\ \hline \mathbf{x}_{1} & 5.9 & 3.0 & 4.2 & 1.5 & 云芝 \\ \end{array}\right)
⎝⎛x1萼片长X15.9萼片宽X23.0花瓣长X34.2花瓣宽X41.5类别X5云芝⎠⎞
1.2 属性
Def.2.
- 数值属性 是指取实数值(或整数值)的属性。
- 若数值属性的取值范围是有限集或无限可数集,则称之为离散数值属性。若只有两种取值,则称为二元属性。
- 若数值属性的取值范围不是离散的则称为连续数值属性。
Def.3. 类别属性 是指取值为符号的属性。
1.3 代数与几何的角度
假设
D
\mathbf{D}
D 中所有属性均为数值的,即
x
i
=
(
x
i
1
,
x
i
2
,
…
,
x
i
d
)
T
∈
R
d
,
i
=
1
,
⋯
,
n
\mathbf{x}_{i}=\left(x_{i 1}, x_{i 2}, \ldots, x_{i d}\right)^{T} \in \mathbb{R}^{d},i=1,\cdots,n
xi=(xi1,xi2,…,xid)T∈Rd,i=1,⋯,n
或
x
j
=
(
x
1
j
,
x
2
j
,
…
,
x
n
j
)
T
∈
R
n
,
j
=
1
,
⋯
,
d
\mathbf{x}_{j}=\left(x_{1 j}, x_{2j}, \ldots, x_{n j}\right)^{T} \in \mathbb{R}^{n},j=1,\cdots,d
xj=(x1j,x2j,…,xnj)T∈Rn,j=1,⋯,d
☆ 默认向量为列向量。
1.3.1 距离与角度
设 a , b ∈ R d \mathbf{a}, \mathbf{b} \in \mathbb{R}^{d} a,b∈Rd ,
- 点乘: a T b = ∑ i = 1 d a i b i \mathbf{a}^{T}\mathbf{b}=\sum\limits_{i=1}^{d} a_ib_i aTb=i=1∑daibi
- 长度(欧氏范数): ∣ a ∣ = a T a = ∑ i = 1 d a i 2 \left | \mathbf{a} \right | =\sqrt{\mathbf{a}^{T}\mathbf{a} } =\sqrt{\sum\limits_{i=1}^{d} a_i^2} ∣a∣=aTa=i=1∑dai2,单位化: a ∣ a ∣ \frac{\mathbf{a}}{|\mathbf{a}|} ∣a∣a
- 距离: δ ( a , b ) = ∣ ∣ a − b ∣ ∣ = ∑ i = 1 d ( a i − b i ) 2 \delta(\mathbf{a},\mathbf{b})=||\mathbf{a}-\mathbf{b}||=\sqrt{\sum\limits_{i=1}^{d}(a_i-b_i)^2} δ(a,b)=∣∣a−b∣∣=i=1∑d(ai−bi)2
- 角度: c o s θ = ( a ∣ a ∣ ) T ( b ∣ b ∣ ) cos \theta =(\frac{\mathbf{a}}{|\mathbf{a}|})^{T}(\frac{\mathbf{b}}{|\mathbf{b}|}) cosθ=(∣a∣a)T(∣b∣b),即单位化后作点乘
- 正交: a \mathbf{a} a 与 b \mathbf{b} b 正交,若 a T b = 0 \mathbf{a}^{T}\mathbf{b}=0 aTb=0
1.3.2 算术平均与总方差
Def.3.
-
算术平均: m e a n ( D ) = μ ^ = 1 n ∑ i = 1 n x i , ∈ R d mean(\mathbf{D})=\hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^n\mathbf{x}_i,\in \mathbb{R}^{d} mean(D)=μ^=n1i=1∑nxi,∈Rd
-
总方差: v a r ( D ) = 1 n ∑ i = 1 n δ ( x i , μ ^ ) 2 var(\mathbf{D})=\frac{1}{n} \sum\limits_{i=1}^{n} \delta\left(\mathbf{x}_{i}, \hat{\boldsymbol{\mu}}\right)^{2} var(D)=n1i=1∑nδ(xi,μ^)2
自行验证: v a r ( D ) = 1 n ∑ i = 1 n ∣ ∣ x i − μ ^ ∣ ∣ 2 = 1 n ∑ i = 1 n ∣ ∣ x i ∣ ∣ 2 − ∣ ∣ μ ^ ∣ ∣ 2 var(\mathbf{D})=\frac{1}{n} \sum\limits_{i=1}^{n}||\mathbf{x}_{i}- \hat{\boldsymbol{\mu}}||^2=\frac{1}{n} \sum\limits_{i=1}^{n}||\mathbf{x}_{i}||^2-||\hat{\boldsymbol{\mu}}||^2 var(D)=n1i=1∑n∣∣xi−μ^∣∣2=n1i=1∑n∣∣xi∣∣2−∣∣μ^∣∣2
-
中心数据矩阵: c e n t e r ( D ) = ( x 1 T − μ ^ T ⋮ x n T − μ ^ T ) center(\mathbf{D})=\begin{pmatrix} \mathbf{x}_{1}^T - \hat{\boldsymbol{\mu}}^T\\ \vdots \\ \mathbf{x}_{n}^T - \hat{\boldsymbol{\mu}}^T \end{pmatrix} center(D)=⎝⎜⎛x1T−μ^T⋮xnT−μ^T⎠⎟⎞
显然 c e n t e r ( D ) center(\mathbf{D}) center(D) 的算术平均为 0 ∈ R d \mathbf{0}\in \mathbb{R}^{d} 0∈Rd
1.3.3 正交投影
Def.4. a , b ∈ R d \mathbf{a}, \mathbf{b} \in \mathbb{R}^{d} a,b∈Rd,向量 b \mathbf{b} b 沿向量 a \mathbf{a} a 方向的正交分解是指,将 b \mathbf{b} b 写成: b = p + r \mathbf{b}= \mathbf{p}+ \mathbf{r} b=p+r。其中, p \mathbf{p} p 是指 b \mathbf{b} b 在 a \mathbf{a} a 方向上的正交投影, r \mathbf{r} r 是指 a \mathbf{a} a 与 b \mathbf{b} b 之间的垂直距离。
a ≠ 0 , b ≠ 0 \mathbf{a}\ne\mathbf{0},\mathbf{b}\ne\mathbf{0} a=0,b=0
设 p = c ⋅ a , ( c ≠ 0 , c ∈ R ) \mathbf{p}=c\cdot\mathbf{a},(c \ne 0,c \in \mathbb{R}) p=c⋅a,(c=0,c∈R) 则 r = b − p = b − c a \mathbf{r}=\mathbf{b}-\mathbf{p}=\mathbf{b}-c\mathbf{a} r=b−p=b−ca
0 = p T r = ( c ⋅ a ) T ( b − c a ) = c ⋅ ( a T b − c ⋅ a T a ) 0 = \mathbf{p}^T\mathbf{r} = (c\cdot\mathbf{a})^T(\mathbf{b}-c\mathbf{a})=c\cdot(\mathbf{a}^T\mathbf{b}-c\cdot\mathbf{a}^T\mathbf{a}) 0=pTr=(c⋅a)T(b−ca)=c⋅(aTb−c⋅aTa)
c = a T b a T a , p = a T b a T a ⋅ a c= \frac{\mathbf{a}^T\mathbf{b}}{\mathbf{a}^T\mathbf{a}}, \mathbf{p}=\frac{\mathbf{a}^T\mathbf{b}}{\mathbf{a}^T\mathbf{a}}\cdot\mathbf{a} c=aTaaTb,p=aTaaTb⋅a
1.3.4 线性相关性与维数
皆与线性代数相同,自读。
1.4 概率观点
每一个数值属性 X X X 被视为一个随机变量,即 X : O → R X:\mathcal{O}\rightarrow \mathbb{R} X:O→R,
其中, O \mathcal{O} O 表示 X X X 的定义域,即所有实验可能输出的集合,即样本空间。 R \mathbb{R} R : X X X 的值域,全体实数。
☆ 注:
- 随机变量是一个函数。
- 若 O \mathcal{O} O 本身是数值的(即 O ⊆ R \mathcal{O}\subseteq \mathbb{R} O⊆R,那么 X X X 是恒等函数,即 X ( v ) = v X(v)=v X(v)=v
- 若 X X X 的函数取值范围为有限集或无限可数集,则称之为离散随机变量,反之,为连续随机变量
Def.5. 若
X
X
X 是离散的,那么
X
X
X 的概率质量函数(probability mass function, PMF)为:
∀
x
∈
R
,
f
(
x
)
=
P
(
X
=
x
)
\forall x \in \mathbb{R},f(x)=P(X=x)
∀x∈R,f(x)=P(X=x)
注:
f
(
x
)
≥
0
,
∑
x
f
(
x
)
=
1
f(x)\ge0,\sum\limits_xf(x)=1
f(x)≥0,x∑f(x)=1;
f
(
x
)
=
0
f(x)=0
f(x)=0,如果
x
∉
x\notin
x∈/ (
x
x
x 的值域)。
Def.6. 若
X
X
X 是连续的,那么
X
X
X 的概率密度函数(probability density function, PDF)为:
P
(
X
∈
[
a
,
b
]
)
=
∫
a
b
f
(
x
)
d
x
P(X\in [a,b])=\int_{a}^{b} f(x)dx
P(X∈[a,b])=∫abf(x)dx
注:
f
(
x
)
≥
0
,
∫
−
∞
+
∞
f
(
x
)
=
1
f(x)\ge0,\int_{-\infty}^{+\infty}f(x)=1
f(x)≥0,∫−∞+∞f(x)=1
Def.7. 对任意随机变量
X
X
X ,定义累积分布函数(cumulative distributution function, CDF)
F
:
R
→
[
0
,
1
]
,
∀
x
∈
R
,
F
(
x
)
=
P
(
X
≤
x
)
F:\mathbb{R}\to[0,1],\forall x\in \mathbb{R},F(x)=P(X\le x)
F:R→[0,1],∀x∈R,F(x)=P(X≤x)
若
X
X
X 是离散的,
F
(
x
)
=
∑
u
≤
x
f
(
u
)
F(x)=\sum\limits_{u\le x}f(u)
F(x)=u≤x∑f(u)
若 X X X 是连续的, F ( x ) = ∫ − ∞ x f ( u ) d u F(x)=\int_{-\infty}^xf(u)du F(x)=∫−∞xf(u)du
1.4.1 二元随机变量
X = ( X 1 X 2 ) , X : O → R 2 \mathbf{X}=\left ( \begin{matrix} X_1 \\ X_2 \end{matrix} \right ), \mathbf{X}:\mathcal{O}\to\mathbb{R}^2 X=(X1X2),X:O→R2 此处 X 1 X_1 X1, X 2 X_2 X2 分别是两个随机变量。
上课时略去了很多概念,补上。
Def.8. 若
X
1
X_1
X1 和
X
2
X_2
X2 都是离散,那么
X
\mathbf{X}
X 的联合概率质量函数被定义为:
f
(
x
)
=
f
(
x
1
,
x
2
)
=
P
(
X
1
=
x
1
,
X
2
=
x
2
)
=
P
(
X
=
x
)
f(\mathbf{x})=f(x_1,x_2)=P(X_1=x_1,X_2=x_2)=P(\mathbf{X}=\mathbf{x})
f(x)=f(x1,x2)=P(X1=x1,X2=x2)=P(X=x)
注:
f
(
x
)
≥
0
,
∑
x
1
∑
x
2
f
(
x
1
,
x
2
)
=
1
f(x)\ge0,\sum\limits_{x_1}\sum\limits_{x_2}f(x_1,x_2)=1
f(x)≥0,x1∑x2∑f(x1,x2)=1
Def.9. 若
X
1
X_1
X1 和
X
2
X_2
X2 都是连续,那么
X
\mathbf{X}
X 的联合概率密度函数被定义为:
P
(
X
∈
W
)
=
∬
x
∈
W
f
(
x
)
d
x
=
∬
(
x
1
,
x
2
)
∈
T
W
f
(
x
1
,
x
2
)
d
x
1
d
x
2
P(\mathbf{X} \in W)=\iint\limits_{\mathbf{x} \in W} f(\mathbf{x}) d \mathbf{x}=\iint\limits_{\left(x_{1}, x_{2}\right)^T_{\in} W} f\left(x_{1}, x_{2}\right) d x_{1} d x_{2}
P(X∈W)=x∈W∬f(x)dx=(x1,x2)∈TW∬f(x1,x2)dx1dx2
其中,
W
⊂
R
2
W \subset \mathbb{R}^2
W⊂R2,
f
(
x
)
≥
0
,
∬
x
∈
R
2
f
(
x
)
d
x
=
1
f(\mathbf{x})\ge0,\iint\limits_{\mathbf{x}\in\mathbb{R}^2}f(\mathbf{x})d\mathbf{x}=1
f(x)≥0,x∈R2∬f(x)dx=1
Def.10.
X
\mathbf{X}
X 的联合累积分布函数
F
F
F
F
(
x
1
,
x
2
)
=
P
(
X
1
≤
x
1
and
X
2
≤
x
2
)
=
P
(
X
≤
x
)
F(x_1,x_2)=P(X_1\le x_1 \text{ and } X_2\le x_2)=P(\mathbf{X}\le\mathbf{x})
F(x1,x2)=P(X1≤x1 and X2≤x2)=P(X≤x)
Def.11.
X
1
X_1
X1 和
X
2
X_2
X2 是独立的,如果
∀
W
1
⊂
R
\forall W_1\subset \mathbb{R}
∀W1⊂R 及
∀
W
2
⊂
R
\forall W_2\subset \mathbb{R}
∀W2⊂R
P
(
X
1
∈
W
1
and
X
2
∈
W
2
)
=
P
(
X
1
∈
W
1
)
⋅
(
X
2
∈
W
2
)
P(X_1\in W_1 \text{ and } X_2\in W_2)=P(X_1\in W_1)\cdot(X_2\in W_2)
P(X1∈W1 and X2∈W2)=P(X1∈W1)⋅(X2∈W2)
Prop. 如果
X
1
X_1
X1 和
X
2
X_2
X2 是独立的,那么
F
(
x
1
,
x
2
)
=
F
1
(
x
1
)
⋅
F
2
(
x
2
)
f
(
x
1
,
x
2
)
=
f
1
(
x
1
)
⋅
f
2
(
x
2
)
F(x_1,x_2)=F_1(x_1)\cdot F_2(x_2)\\ f(x_1,x_2)=f_1(x_1)\cdot f_2(x_2)
F(x1,x2)=F1(x1)⋅F2(x2)f(x1,x2)=f1(x1)⋅f2(x2)
其中
F
i
F_i
Fi 是
X
i
X_i
Xi 的累积分布函数,
f
i
f_i
fi 是
x
i
x_i
xi 的 PMF 或 PDF。
1.4.2 多元随机变量
平行推广1.4.1节中的各定义即可。
1.4.3 随机样本与统计量
Def.12. 给定随机变量 X X X ,来源于 X X X 的长度为 n n n 的随机样本是指 n n n 个独立的且同分布(均与 X X X 具有同样的 PMF 或 PDF)的随机变量 S 1 , S 2 , ⋯ , S n S_1,S_2,\cdots,S_n S1,S2,⋯,Sn。
Def.13. 统计量 θ ^ \hat{\theta} θ^ 被定义为关于随机样本的函数 θ ^ : ( S 1 , S 2 , ⋯ , S n ) → R \hat{\theta}:(S_1,S_2,\cdots,S_n)\to \mathbb{R} θ^:(S1,S2,⋯,Sn)→R
注: θ ^ \hat{\theta} θ^ 本身也是随机变量
Chapter 2:数值属性
关注代数、几何与统计观点。
2.1 一元分析
仅关注一项属性, D = ( X x 1 x 2 ⋮ x n ) , x i ∈ R \mathbf{D}=\left(\begin{array}{c} X \\ \hline x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{array}\right),x_i\in\mathbb{R} D=⎝⎜⎜⎜⎜⎜⎛Xx1x2⋮xn⎠⎟⎟⎟⎟⎟⎞,xi∈R
统计: X X X 可视为(高维)随机变量, x i x_i xi 均是恒等随机变量, x 1 , ⋯ , x n x_1,\cdots,x_n x1,⋯,xn 也看作源于 X X X 的长度为 n n n 的随机样本。
Def.1. 经验积累分布函数
Def.2. 反积累分布函数
Def.3. 随机变量
X
X
X 的经验概率质量函数是指
f
^
(
x
)
=
1
n
∑
i
=
1
n
I
(
x
i
=
x
)
,
∀
x
i
∈
R
I
(
x
i
=
x
)
=
{
1
,
x
i
=
x
0
,
x
i
≠
x
\hat{f}(x)=\frac{1}{n} \sum_{i=1}^{n} I\left(x_{i} = x\right),\forall x_i \in \mathbb{R}\\ I\left(x_{i} = x\right)=\left\{\begin{matrix} 1,x_i=x\\ 0,x_i\ne x \end{matrix}\right.
f^(x)=n1i=1∑nI(xi=x),∀xi∈RI(xi=x)={1,xi=x0,xi=x
2.1.1 集中趋势量数
Def.4. 离散随机变量 X X X 的期望是指: μ : = E ( X ) = ∑ x x f ( x ) \mu:=E(X) = \sum\limits_{x} xf(x) μ:=E(X)=x∑xf(x), f ( x ) f(x) f(x) 是 X X X 的PMF
连续随机变量 X X X 的期望是指: μ : = E ( X ) = ∫ − ∞ + ∞ x f ( x ) d x \mu:=E(X) = \int\limits_{-\infin}^{+\infin} xf(x)dx μ:=E(X)=−∞∫+∞xf(x)dx, f ( x ) f(x) f(x) 是 X X X 的PDF
注: E ( a X + b Y ) = a E ( X ) + b E ( Y ) E(aX+bY)=aE(X)+bE(Y) E(aX+bY)=aE(X)+bE(Y)
Def.5. X X X 的样本平均值是指 μ ^ = 1 n ∑ i = 1 n x i \hat{\mu}=\frac{1}{n} \sum\limits_{i=1}^{n}x_i μ^=n1i=1∑nxi,注 μ ^ \hat{\mu} μ^ 是 μ \mu μ 的估计量
Def.6. 一个估计量(统计量) θ ^ \hat{\theta} θ^ 被称作统计量 θ \theta θ 的无偏估计,如果 E ( θ ^ ) = θ E(\hat{\theta})=\theta E(θ^)=θ
自证:样本平均值 μ ^ \hat{\mu} μ^ 是期望 μ \mu μ 的无偏估计量, E ( x i ) = μ for all x i E(x_i)=\mu \text{ for all } x_i E(xi)=μ for all xi
Def.7. 一个估计量是稳健的,如果它不会被样本中的极值影响。(样本平均值并不是稳健的。)
Def.8. 随机变量 X X X 的中位数
Def.9. 随机变量 X X X 的样本中位数
Def.10. 随机变量 X X X 的众数, 随机变量 X X X 的样本众数
2.2.2 离差量数
Def.11. 随机变量 X X X 的极差与样本极差
Def.12. 随机变量 X X X 的四分位距,样本的四分位距
Def.13. 随机变量
X
X
X 的方差是
σ
2
=
var
(
X
)
=
E
[
(
X
−
μ
)
2
]
=
{
∑
x
(
x
−
μ
)
2
f
(
x
)
if
X
is discrete
∫
−
∞
∞
(
x
−
μ
)
2
f
(
x
)
d
x
if
X
is continuous
\sigma^{2}=\operatorname{var}(X)=E\left[(X-\mu)^{2}\right]=\left\{\begin{array}{ll} \sum_{x}(x-\mu)^{2} f(x) & \text { if } X \text { is discrete } \\ \\ \int_{-\infty}^{\infty}(x-\mu)^{2} f(x) d x & \text { if } X \text { is continuous } \end{array}\right.
σ2=var(X)=E[(X−μ)2]=⎩⎨⎧∑x(x−μ)2f(x)∫−∞∞(x−μ)2f(x)dx if X is discrete if X is continuous
标准差 σ \sigma σ 是指 σ 2 \sigma^2 σ2 的正的平方根。
注:方差是关于期望的第二阶动差, r r r 阶动差是指 E [ ( x − μ ) r ] E[(x-\mu)^r] E[(x−μ)r]。
性质:
- σ 2 = E ( X 2 ) − μ 2 = E ( X 2 ) − [ E ( X ) ] 2 \sigma^2=E(X^2)-\mu^2=E(X^2)-[E(X)]^2 σ2=E(X2)−μ2=E(X2)−[E(X)]2
- v a r ( X 1 + X 2 ) = v a r ( X 1 ) + v a r ( X 2 ) var(X_1+X_2)=var(X_1)+var(X_2) var(X1+X2)=var(X1)+var(X2), X 1 , X 2 X_1,X_2 X1,X2 独立
Def.14. 样本方差是 σ ^ 2 = 1 n ∑ i = 1 n ( x i − μ ^ ) 2 \hat{\sigma}^{2}=\frac{1}{n} \sum\limits_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)^{2} σ^2=n1i=1∑n(xi−μ^)2,底下非 n − 1 n-1 n−1
样本方差的几何意义:考虑中心化数据矩阵
C
:
=
(
x
1
−
μ
^
x
2
−
μ
^
⋮
x
n
−
μ
^
)
n
⋅
σ
^
2
=
∑
i
=
1
n
(
x
i
−
μ
^
)
2
=
∣
∣
C
∣
∣
2
C:=\left(\begin{array}{c} x_{1}-\hat{\mu} \\ x_{2}-\hat{\mu} \\ \vdots \\ x_{n}-\hat{\mu} \end{array}\right)\\ n\cdot \hat{\sigma}^2=\sum\limits_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)^{2}=||C||^2
C:=⎝⎜⎜⎜⎛x1−μ^x2−μ^⋮xn−μ^⎠⎟⎟⎟⎞n⋅σ^2=i=1∑n(xi−μ^)2=∣∣C∣∣2
问题:
X
X
X 的样本平均数的期望与方差?
E
(
μ
^
)
=
E
(
1
n
∑
i
=
1
n
x
i
)
=
1
n
∑
i
=
1
n
E
(
x
i
)
=
1
n
∑
i
=
1
n
μ
=
μ
E(\hat{\mu})=E(\frac{1}{n} \sum\limits_{i=1}^{n}x_i)=\frac{1}{n} \sum\limits_{i=1}^{n} E(x_i)=\frac{1}{n}\sum\limits_{i=1}^{n}\mu=\mu\\
E(μ^)=E(n1i=1∑nxi)=n1i=1∑nE(xi)=n1i=1∑nμ=μ
方差有两种方法:第一种直接展开,第二种:运用
x
1
,
⋯
,
x
n
x_1,\cdots,x_n
x1,⋯,xn 独立同分布:
v
a
r
(
∑
i
=
1
n
x
i
)
)
=
∑
i
=
1
n
v
a
r
(
x
i
)
=
n
⋅
σ
2
⟹
v
a
r
(
μ
^
)
=
σ
2
n
var(\sum\limits_{i=1}^{n}x_i))=\sum\limits_{i=1}^{n}var(x_i)=n\cdot \sigma^2\Longrightarrow var(\hat{\mu})=\frac{\sigma^2}{n}
var(i=1∑nxi))=i=1∑nvar(xi)=n⋅σ2⟹var(μ^)=nσ2
注:样本方差是有偏估计,因为:
E
(
σ
2
)
=
(
n
−
1
n
)
σ
2
→
n
→
+
∞
σ
2
E(\sigma^2)=(\frac{n-1}{n})\sigma^2\xrightarrow{n\to +\infin}\sigma^2
E(σ2)=(nn−1)σ2n→+∞σ2
2.2 二元分析
略
2.3 多元分析
D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D=⎝⎜⎜⎜⎜⎜⎛x1x2⋮xnX1x11x21⋮xn1X2x12x22⋮xn2⋯⋯⋯⋱⋯Xdx1dx2d⋮xnd⎠⎟⎟⎟⎟⎟⎞
可视为: X = ( X 1 , ⋯ , X d ) T \mathbf{X}=(X_1,\cdots,X_d)^T X=(X1,⋯,Xd)T
Def.15. 对于随机变量向量 X \mathbf{X} X,其期望向量为: E [ X ] = ( E [ X 1 ] E [ X 2 ] ⋮ E [ X d ] ) E[\mathbf{X}]=\left(\begin{array}{c} E\left[X_{1}\right] \\ E\left[X_{2}\right] \\ \vdots \\ E\left[X_{d}\right] \end{array}\right) E[X]=⎝⎜⎜⎜⎛E[X1]E[X2]⋮E[Xd]⎠⎟⎟⎟⎞
样本平均值为: μ ^ = 1 n ∑ i = 1 n x i , ( = m e a n ( D ) ) ∈ R d \hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^{n} \mathbf{x}_{i},(=mean(\mathbf{D})) \in \mathbb{R}^{d} μ^=n1i=1∑nxi,(=mean(D))∈Rd
Def.16. 对于 X 1 , X 2 X_1,X_2 X1,X2,定义协方差 σ 12 = E [ ( X 1 − E ( X 1 ) ) ( X 2 − E ( X 2 ) ] = E ( X 1 X 2 ) − E ( X 1 ) E ( X 2 ) \sigma_{12}=E[(X_1-E(X_1))(X_2-E(X_2)]=E(X_1X_2)-E(X_1)E(X_2) σ12=E[(X1−E(X1))(X2−E(X2)]=E(X1X2)−E(X1)E(X2)
Remark:
- σ 12 = σ 21 \sigma_{12}=\sigma_{21} σ12=σ21
- 若两者独立,则 σ 12 = 0 \sigma_{12}=0 σ12=0
Def.17. 对于随机变量向量
X
=
(
X
1
,
⋯
,
X
d
)
T
\mathbf{X}=(X_1,\cdots,X_d)^T
X=(X1,⋯,Xd)T,定义协方差矩阵:
Σ
=
E
[
(
X
−
μ
)
(
X
−
μ
)
T
]
=
(
σ
1
2
σ
12
⋯
σ
1
d
σ
21
σ
2
2
⋯
σ
2
d
⋯
⋯
⋯
⋯
σ
d
1
σ
d
2
⋯
σ
d
2
)
d
×
d
\boldsymbol{\Sigma}=E\left[(\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^{T}\right]=\left(\begin{array}{cccc} \sigma_{1}^{2} & \sigma_{12} & \cdots & \sigma_{1 d} \\ \sigma_{21} & \sigma_{2}^{2} & \cdots & \sigma_{2 d} \\ \cdots & \cdots & \cdots & \cdots \\ \sigma_{d 1} & \sigma_{d 2} & \cdots & \sigma_{d}^{2} \end{array}\right)_{d\times d}
Σ=E[(X−μ)(X−μ)T]=⎝⎜⎜⎛σ12σ21⋯σd1σ12σ22⋯σd2⋯⋯⋯⋯σ1dσ2d⋯σd2⎠⎟⎟⎞d×d
其为对称矩阵,定义
X
\mathbf{X}
X 的广义方差为
d
e
t
(
Σ
)
det(\boldsymbol{\Sigma})
det(Σ)
注:
- Σ \boldsymbol{\Sigma} Σ 是实对称矩阵且半正定,即所有特征值非负, λ 1 ≥ λ 2 ⋯ ≥ λ d ≥ 0 \lambda_1\ge \lambda_2 \cdots \ge\lambda_d \ge 0 λ1≥λ2⋯≥λd≥0
- v a r ( D ) = t r ( Σ ) = σ 1 2 + ⋯ + σ d 2 var(\mathbf{D})=tr(\Sigma)=\sigma_1^2+\cdots+\sigma_d^2 var(D)=tr(Σ)=σ12+⋯+σd2
Def.18. 对于
X
=
(
X
1
,
⋯
,
X
d
)
T
\mathbf{X}=(X_1,\cdots,X_d)^T
X=(X1,⋯,Xd)T,定义样本协方差矩阵
Σ
^
=
1
n
(
Z
T
Z
)
=
1
n
(
Z
1
T
Z
1
Z
1
T
Z
2
⋯
Z
1
T
Z
d
Z
2
T
Z
1
Z
2
T
Z
2
⋯
Z
2
T
Z
d
⋮
⋮
⋱
⋮
Z
d
T
Z
1
Z
d
T
Z
2
⋯
Z
d
T
Z
d
)
d
×
d
\hat{\boldsymbol{\Sigma}}=\frac{1}{n}\left(\mathbf{Z}^{T} \mathbf{Z}\right)=\frac{1}{n}\left(\begin{array}{cccc} Z_{1}^{T} Z_{1} & Z_{1}^{T} Z_{2} & \cdots & Z_{1}^{T} Z_{d} \\ Z_{2}^{T} Z_{1} & Z_{2}^{T} Z_{2} & \cdots & Z_{2}^{T} Z_{d} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{d}^{T} Z_{1} & Z_{d}^{T} Z_{2} & \cdots & Z_{d}^{T} Z_{d} \end{array}\right)_{d\times d}
Σ^=n1(ZTZ)=n1⎝⎜⎜⎜⎛Z1TZ1Z2TZ1⋮ZdTZ1Z1TZ2Z2TZ2⋮ZdTZ2⋯⋯⋱⋯Z1TZdZ2TZd⋮ZdTZd⎠⎟⎟⎟⎞d×d
其中
Z
=
D
−
1
⋅
μ
^
T
=
(
x
1
T
−
μ
^
T
x
2
T
−
μ
^
T
⋮
x
n
T
−
μ
^
T
)
=
(
−
z
1
T
−
−
z
2
T
−
⋮
−
z
n
T
−
)
=
(
∣
∣
∣
Z
1
Z
2
⋯
Z
d
∣
∣
∣
)
\mathbf{Z}=\mathbf{D}-\mathbf{1} \cdot \hat{\boldsymbol{\mu}}^{T}=\left(\begin{array}{c} \mathbf{x}_{1}^{T}-\hat{\boldsymbol{\mu}}^{T} \\ \mathbf{x}_{2}^{T}-\hat{\boldsymbol{\mu}}^{T} \\ \vdots \\ \mathbf{x}_{n}^{T}-\hat{\boldsymbol{\mu}}^{T} \end{array}\right)=\left(\begin{array}{ccc} -& \mathbf{z}_{1}^{T} & - \\ -& \mathbf{z}_{2}^{T} & - \\ & \vdots \\ -& \mathbf{z}_{n}^{T} & - \end{array}\right)=\left(\begin{array}{cccc} \mid & \mid & & \mid \\ Z_{1} & Z_{2} & \cdots & Z_{d} \\ \mid & \mid & & \mid \end{array}\right)
Z=D−1⋅μ^T=⎝⎜⎜⎜⎛x1T−μ^Tx2T−μ^T⋮xnT−μ^T⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛−−−z1Tz2T⋮znT−−−⎠⎟⎟⎟⎞=⎝⎛∣Z1∣∣Z2∣⋯∣Zd∣⎠⎞
样本总方差是 t r ( Σ ^ ) tr(\hat{\boldsymbol{\Sigma}}) tr(Σ^),广义样本方差是 d e t ( Σ ^ ) ≥ 0 det(\hat{\boldsymbol{\Sigma}})\ge0 det(Σ^)≥0
Σ ^ = 1 n ∑ i = 1 n z i z i T \hat{\boldsymbol{\Sigma}}=\frac{1}{n}\sum\limits_{i=1}^n\mathbf{z}_{i}\mathbf{z}_{i}^T Σ^=n1i=1∑nziziT
Chapter 5 Kernel Method:核方法
Example 5.1 略, ϕ ( 核 映 射 ) : Σ ∗ ( 输 入 空 间 ) → R 4 ( 特 征 空 间 ) \phi(核映射):\Sigma^*(输入空间)\to \mathbb{R}^4(特征空间) ϕ(核映射):Σ∗(输入空间)→R4(特征空间)
Def.1. 假设核映射 ϕ : I → F \phi:\mathcal{I}\to \mathcal{F} ϕ:I→F, ϕ \phi ϕ 的核函数是指 K : I × I → R K:\mathcal{I}\times\mathcal{I}\to \mathbb{R} K:I×I→R 使得 ∀ ( x i , x j ) ∈ I × I , K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) \forall (\mathbf{x}_i,\mathbf{x}_j)\in \mathcal{I}\times\mathcal{I},K(\mathbf{x}_i,\mathbf{x}_j)=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j) ∀(xi,xj)∈I×I,K(xi,xj)=ϕT(xi)ϕ(xj)
Example 5.2 设 ϕ : R 2 → R 3 \phi:\mathbb{R}^2\to \mathbb{R}^3 ϕ:R2→R3 使得 ∀ a = ( a 1 , a 2 ) , ϕ ( a ) = ( a 1 2 , a 2 2 , 2 a 1 a 2 ) T \forall \mathbf{a}=(a_1,a_2),\phi(\mathbf{a})=(a_1^2,a_2^2,\sqrt2a_1a_2)^T ∀a=(a1,a2),ϕ(a)=(a12,a22,2a1a2)T
注意到 K ( a , b ) = ϕ ( a ) T ϕ ( b ) = a 1 2 b 1 2 + a 2 2 b 2 2 + 2 a 1 2 a 2 2 b 1 2 b 2 2 K(\mathbf{a},\mathbf{b})=\phi(\mathbf{a})^T\phi(\mathbf{b})=a_1^2b_1^2+a_2^2b_2^2+2a_1^2a_2^2b_1^2b_2^2 K(a,b)=ϕ(a)Tϕ(b)=a12b12+a22b22+2a12a22b12b22, K : R 2 × R 2 → R K:\mathbb{R}^2\times\mathbb{R}^2\to \mathbb{R} K:R2×R2→R
Remark:
- 分析复杂数据
- 分析非线性特征(知乎搜核函数有什么作用)
Goal:在未知 ϕ \phi ϕ 的情况下,通过分析 K K K 来分析特征空间 F \mathcal{F} F 结果。
5.1 核矩阵
设 D = { x 1 , x 2 , … , x n } ⊂ I \mathbf{D}=\left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{n}\right\} \subset \mathcal{I} D={x1,x2,…,xn}⊂I,其核矩阵定义为: K = [ K ( x i , x j ) ] n × n \mathbf{K}=[K(\mathbf{x}_{i},\mathbf{x}_{j})]_{n\times n} K=[K(xi,xj)]n×n
Prop. 核矩阵 K \mathbf{K} K 是对称的且半正定的
Proof. K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) = ϕ T ( x j ) ϕ ( x i ) = K ( x j , x i ) K(\mathbf{x}_{i},\mathbf{x}_{j})=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j)=\phi^T(\mathbf{x}_j)\phi(\mathbf{x}_i)=K(\mathbf{x}_{j},\mathbf{x}_{i}) K(xi,xj)=ϕT(xi)ϕ(xj)=ϕT(xj)ϕ(xi)=K(xj,xi),故对称。
对于
∀
a
T
∈
R
n
\forall \mathbf{a}^{T}\in \mathbb{R}^n
∀aT∈Rn,
a
T
K
a
=
∑
i
=
1
n
∑
j
=
1
n
a
i
a
j
K
(
x
i
,
x
j
)
=
∑
i
=
1
n
∑
j
=
1
n
a
i
a
j
ϕ
(
x
i
)
T
ϕ
(
x
j
)
=
(
∑
i
=
1
n
a
i
ϕ
(
x
i
)
)
T
(
∑
j
=
1
n
a
j
ϕ
(
x
j
)
)
=
∥
∑
i
=
1
n
a
i
ϕ
(
x
i
)
∥
2
≥
0
\begin{aligned} \mathbf{a}^{T} \mathbf{K a} &=\sum_{i=1}^{n} \sum_{j=1}^{n} a_{i} a_{j} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \\ &=\sum_{i=1}^{n} \sum_{j=1}^{n} a_{i} a_{j} \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) \\ &=\left(\sum_{i=1}^{n} a_{i} \phi\left(\mathbf{x}_{i}\right)\right)^{T}\left(\sum_{j=1}^{n} a_{j} \phi\left(\mathbf{x}_{j}\right)\right) \\ &=\left\|\sum_{i=1}^{n} a_{i} \phi\left(\mathbf{x}_{i}\right)\right\|^2 \geq 0 \end{aligned}
aTKa=i=1∑nj=1∑naiajK(xi,xj)=i=1∑nj=1∑naiajϕ(xi)Tϕ(xj)=(i=1∑naiϕ(xi))T(j=1∑najϕ(xj))=∥∥∥∥∥i=1∑naiϕ(xi)∥∥∥∥∥2≥0
5.1.1 核映射的重构
”经验核映射“
已知 D = { x i } i = 1 n ⊂ I \mathbf{D}=\left\{\mathbf{x}_{i}\right\}_{i=1}^{n} \subset \mathcal{I} D={xi}i=1n⊂I 与核矩阵 K \mathbf{K} K
目标:寻找 ϕ : I → F ⊂ R n \phi:\mathcal{I} \to \mathcal{F} \subset \mathbb{R}^n ϕ:I→F⊂Rn
首先尝试: ∀ x ∈ I , ϕ ( x ) = ( K ( x 1 , x ) , K ( x 2 , x ) , … , K ( x n , x ) ) T ∈ R n \forall \mathbf{x} \in \mathcal{I},\phi(\mathbf{x})=\left(K\left(\mathbf{x}_{1}, \mathbf{x}\right), K\left(\mathbf{x}_{2}, \mathbf{x}\right), \ldots, K\left(\mathbf{x}_{n}, \mathbf{x}\right)\right)^{T} \in \mathbb{R}^{n} ∀x∈I,ϕ(x)=(K(x1,x),K(x2,x),…,K(xn,x))T∈Rn
检查: ϕ T ( x i ) ϕ ( x j ) ? = K ( x i , x j ) \phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j)?=K(\mathbf{x}_{i},\mathbf{x}_{j}) ϕT(xi)ϕ(xj)?=K(xi,xj)
左边 = ϕ ( x i ) T ϕ ( x j ) = ∑ k = 1 n K ( x k , x i ) K ( x k , x j ) = K i T K j =\phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right)=\sum\limits_{k=1}^{n} K\left(\mathbf{x}_{k}, \mathbf{x}_{i}\right) K\left(\mathbf{x}_{k}, \mathbf{x}_{j}\right)=\mathbf{K}_{i}^{T} \mathbf{K}_{j} =ϕ(xi)Tϕ(xj)=k=1∑nK(xk,xi)K(xk,xj)=KiTKj, K i \mathbf{K}_{i} Ki 代表第 i i i 行或列要求太高。
考虑改进:寻找矩阵 A \mathbf{A} A 使得, K i T A K j = K ( x i , x j ) \mathbf{K}_{i}^{T} \mathbf{A} \mathbf{K}_{j}=\mathbf{K}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) KiTAKj=K(xi,xj),即 K T A K = K \mathbf{K}^{T} \mathbf{A} \mathbf{K}=\mathbf{K} KTAK=K
故只需取 A = K − 1 \mathbf{A}=\mathbf{K}^{-1} A=K−1 即可( K \mathbf{K} K 可逆)
若 K \mathbf{K} K 正定, K − 1 \mathbf{K}^{-1} K−1 也正定,即存在一个实矩阵 B \mathbf{B} B 满足 K − 1 = B T B \mathbf{K}^{-1}=\mathbf{B}^{T}\mathbf{B} K−1=BTB
故经验核函数可定义为:
ϕ
(
x
)
=
B
⋅
(
K
(
x
1
,
x
)
,
K
(
x
2
,
x
)
,
…
,
K
(
x
n
,
x
)
)
T
\phi(\mathbf{x})=\mathbf{B}\cdot\left(K\left(\mathbf{x}_{1}, \mathbf{x}\right), K\left(\mathbf{x}_{2}, \mathbf{x}\right), \ldots, K\left(\mathbf{x}_{n}, \mathbf{x}\right)\right)^{T}
ϕ(x)=B⋅(K(x1,x),K(x2,x),…,K(xn,x))T
检查:
ϕ
T
(
x
i
)
ϕ
(
x
j
)
=
(
B
K
i
)
T
(
B
K
j
)
=
K
i
T
K
−
1
K
j
=
(
K
T
K
−
1
K
)
i
,
j
=
K
(
x
i
,
x
j
)
\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j)=(\mathbf{B}\mathbf{K}_i)^T(\mathbf{B}\mathbf{K}_j)=\mathbf{K}_i^T\mathbf{K}^{-1}\mathbf{K}_j=(\mathbf{K}^T\mathbf{K}^{-1}\mathbf{K})_{i,j}=K(\mathbf{x}_{i},\mathbf{x}_{j})
ϕT(xi)ϕ(xj)=(BKi)T(BKj)=KiTK−1Kj=(KTK−1K)i,j=K(xi,xj)
5.1.2 特定数据的海塞核映射
对于对称半正定矩阵
K
n
×
n
\mathbf{K}_{n\times n}
Kn×n,存在分解
K
=
U
(
λ
1
0
⋯
0
0
λ
2
⋯
0
⋮
⋮
⋱
⋮
0
0
⋯
λ
n
)
U
T
=
U
Λ
U
T
\mathbf{K}=\mathbf{U}\left(\begin{array}{cccc} \lambda_{1} & 0 & \cdots & 0 \\ 0 & \lambda_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{n} \end{array}\right)\mathbf{U}^{T}=\mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^{T}
K=U⎝⎜⎜⎜⎛λ10⋮00λ2⋮0⋯⋯⋱⋯00⋮λn⎠⎟⎟⎟⎞UT=UΛUT
λ
i
\lambda_{i}
λi 为特征值,
U
=
(
∣
∣
∣
u
1
u
2
⋯
u
n
∣
∣
∣
)
\mathbf{U}=\left(\begin{array}{cccc} \mid & \mid & & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} & \cdots & \mathbf{u}_{n} \\ \mid & \mid & & \mid \end{array}\right)
U=⎝⎛∣u1∣∣u2∣⋯∣un∣⎠⎞ 为单位正交矩阵,
u
i
=
(
u
i
1
,
u
i
2
,
…
,
u
i
n
)
T
∈
R
n
\mathbf{u}_{i}=\left(u_{i 1}, u_{i 2}, \ldots, u_{i n}\right)^{T} \in \mathbb{R}^{n}
ui=(ui1,ui2,…,uin)T∈Rn 为特征向量,即
K
=
λ
1
u
1
u
1
T
+
λ
2
u
2
u
2
T
+
⋯
+
λ
n
u
n
u
n
T
K
(
x
i
,
x
j
)
=
λ
1
u
1
i
u
1
j
+
λ
2
u
2
i
u
2
j
⋯
+
λ
n
u
n
i
u
n
j
=
∑
k
=
1
n
λ
k
u
k
i
u
k
j
\mathbf{K}=\lambda_{1} \mathbf{u}_{1} \mathbf{u}_{1}^{T}+\lambda_{2} \mathbf{u}_{2} \mathbf{u}_{2}^{T}+\cdots+\lambda_{n} \mathbf{u}_{n} \mathbf{u}_{n}^{T}\\ \begin{aligned} \mathbf{K}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) &=\lambda_{1} u_{1 i} u_{1 j}+\lambda_{2} u_{2 i} u_{2 j} \cdots+\lambda_{n} u_{n i} u_{n j} \\ &=\sum_{k=1}^{n} \lambda_{k} u_{k i} u_{k j} \end{aligned}
K=λ1u1u1T+λ2u2u2T+⋯+λnununTK(xi,xj)=λ1u1iu1j+λ2u2iu2j⋯+λnuniunj=k=1∑nλkukiukj
定义海塞映射:
∀
x
i
∈
D
,
ϕ
(
x
i
)
=
(
λ
1
u
1
i
,
λ
2
u
2
i
,
…
,
λ
n
u
n
i
)
T
\forall \mathbf{x}_i \in \mathbf {D}, \phi\left(\mathbf{x}_{i}\right)=\left(\sqrt{\lambda_{1}} u_{1 i}, \sqrt{\lambda_{2}} u_{2 i}, \ldots, \sqrt{\lambda_{n}} u_{n i}\right)^{T}
∀xi∈D,ϕ(xi)=(λ1u1i,λ2u2i,…,λnuni)T
检查:
ϕ
(
x
i
)
T
ϕ
(
x
j
)
=
(
λ
1
u
1
i
,
…
,
λ
n
u
n
i
)
(
λ
1
u
1
j
,
…
,
λ
n
u
n
j
)
T
=
λ
1
u
1
i
u
1
j
+
⋯
+
λ
n
u
n
i
u
n
j
=
K
(
x
i
,
x
j
)
\begin{aligned} \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) &=\left(\sqrt{\lambda_{1}} u_{1 i}, \ldots, \sqrt{\lambda_{n}} u_{n i}\right)\left(\sqrt{\lambda_{1}} u_{1 j}, \ldots, \sqrt{\lambda_{n}} u_{n j}\right)^{T} \\ &=\lambda_{1} u_{1 i} u_{1 j}+\cdots+\lambda_{n} u_{n i} u_{n j}=K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \end{aligned}
ϕ(xi)Tϕ(xj)=(λ1u1i,…,λnuni)(λ1u1j,…,λnunj)T=λ1u1iu1j+⋯+λnuniunj=K(xi,xj)
注意:海塞映射中仅对
D
\mathbf{D}
D 中的数
x
i
\mathbf{x}_i
xi 有定义。
5.2 向量核函数
R d × R d → R \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R} Rd×Rd→R
典型向量核函数:多项式核
∀ x , y ∈ R d , K q ( x , y ) = ϕ ( x ) T ϕ ( y ) = ( x T y + c ) q \forall \mathbf{x},\mathbf{y} \in \mathbb {R}^d, K_{q}(\mathbf{x}, \mathbf{y})=\phi(\mathbf{x})^{T} \phi(\mathbf{y})=\left(\mathbf{x}^{T} \mathbf{y} + c \right)^{q} ∀x,y∈Rd,Kq(x,y)=ϕ(x)Tϕ(y)=(xTy+c)q,其中 c ≥ 0 c\ge 0 c≥0
若 c = 0 c=0 c=0,齐次,否则为非齐次。
问题:构造核映射 ϕ : R d → F \phi:\mathbb{R}^d \to \mathcal{F} ϕ:Rd→F,使得 K q ( x , y ) = ϕ ( x ) T ϕ ( y ) K_{q}(\mathbf{x}, \mathbf{y})=\phi(\mathbf{x})^{T} \phi(\mathbf{y}) Kq(x,y)=ϕ(x)Tϕ(y)
注: q = 1 , c = 0 , ϕ ( x ) = x q=1,c=0, \phi (\mathbf{x})=\mathbf{x} q=1,c=0,ϕ(x)=x
示例: q = 2 , d = 2 q=2,d=2 q=2,d=2
高斯核自读
5.3 特征空间中基本核运算
ϕ : I → F , K : I × I → R \phi:\mathcal{I} \to \mathcal{F}, K:\mathcal{I} \times \mathcal{I}\to \mathbb{R} ϕ:I→F,K:I×I→R
-
向量长度: ∥ ϕ ( x ) ∥ 2 = ϕ ( x ) T ϕ ( x ) = K ( x , x ) \|\phi(\mathbf{x})\|^{2}=\phi(\mathbf{x})^{T} \phi(\mathbf{x})=K(\mathbf{x}, \mathbf{x}) ∥ϕ(x)∥2=ϕ(x)Tϕ(x)=K(x,x)
-
距离:
∥ ϕ ( x i ) − ϕ ( x j ) ∥ 2 = ∥ ϕ ( x i ) ∥ 2 + ∥ ϕ ( x j ) ∥ 2 − 2 ϕ ( x i ) T ϕ ( x j ) = K ( x i , x i ) + K ( x j , x j ) − 2 K ( x i , x j ) \begin{aligned} \left\|\phi\left(\mathbf{x}_{i}\right)-\phi\left(\mathbf{x}_{j}\right)\right\|^{2} &=\left\|\phi\left(\mathbf{x}_{i}\right)\right\|^{2}+\left\|\phi\left(\mathbf{x}_{j}\right)\right\|^{2}-2 \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) \\ &=K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)+K\left(\mathbf{x}_{j}, \mathbf{x}_{j}\right)-2 K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \end{aligned} ∥ϕ(xi)−ϕ(xj)∥2=∥ϕ(xi)∥2+∥ϕ(xj)∥2−2ϕ(xi)Tϕ(xj)=K(xi,xi)+K(xj,xj)−2K(xi,xj)
2 K ( x , y ) = ∥ ϕ ( x ) ∥ 2 + ∥ ϕ ( y ) ∥ 2 − ∥ ϕ ( x ) − ϕ ( y ) ∥ 2 2 K\left(\mathbf{x}, \mathbf{y}\right)=\left\|\phi\left(\mathbf{x}\right)\right\|^{2}+\left\|\phi\left(\mathbf{y}\right)\right\|^{2}-\left\|\phi\left(\mathbf{x}\right)-\phi\left(\mathbf{y}\right)\right\|^{2} 2K(x,y)=∥ϕ(x)∥2+∥ϕ(y)∥2−∥ϕ(x)−ϕ(y)∥2
代表 ϕ ( x ) \phi\left(\mathbf{x}\right) ϕ(x) 与 ϕ ( y ) \phi\left(\mathbf{y}\right) ϕ(y) 的相似度
-
平均值: μ ϕ = 1 n ∑ i = 1 n ϕ ( x i ) \boldsymbol{\mu}_{\phi}=\frac{1}{n} \sum\limits_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right) μϕ=n1i=1∑nϕ(xi)
∥ μ ϕ ∥ 2 = μ ϕ T μ ϕ = ( 1 n ∑ i = 1 n ϕ ( x i ) ) T ( 1 n ∑ j = 1 n ϕ ( x j ) ) = 1 n 2 ∑ i = 1 n ∑ j = 1 n ϕ ( x i ) T ϕ ( x j ) = 1 n 2 ∑ i = 1 n ∑ j = 1 n K ( x i , x j ) \begin{aligned} \left\|\boldsymbol{\mu}_{\phi}\right\|^{2} &=\boldsymbol{\mu}_{\phi}^{T} \boldsymbol{\mu}_{\phi} \\ &=\left(\frac{1}{n} \sum\limits_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right)\right)^{T}\left(\frac{1}{n} \sum\limits_{j=1}^{n} \phi\left(\mathbf{x}_{j}\right)\right) \\ &=\frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) \\ &=\frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \end{aligned} ∥∥μϕ∥∥2=μϕTμϕ=(n1i=1∑nϕ(xi))T(n1j=1∑nϕ(xj))=n21i=1∑nj=1∑nϕ(xi)Tϕ(xj)=n21i=1∑nj=1∑nK(xi,xj) -
总方差: σ ϕ 2 = 1 n ∑ i = 1 n ∥ ϕ ( x i ) − μ ϕ ∥ 2 \sigma_{\phi}^{2}=\frac{1}{n} \sum\limits_{i=1}^{n}\left\|\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi}\right\|^{2} σϕ2=n1i=1∑n∥∥ϕ(xi)−μϕ∥∥2, ∀ x i \forall \mathbf{x}_{i} ∀xi
∥ ϕ ( x i ) − μ ϕ ∥ 2 = ∥ ϕ ( x i ) ∥ 2 − 2 ϕ ( x i ) T μ ϕ + ∥ μ ϕ ∥ 2 = K ( x i , x i ) − 2 n ∑ j = 1 n K ( x i , x j ) + 1 n 2 ∑ s = 1 n ∑ t = 1 n K ( x s , x t ) \begin{aligned} \left\|\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi}\right\|^{2} &=\left\|\phi\left(\mathbf{x}_{i}\right)\right\|^{2}-2 \phi\left(\mathbf{x}_{i}\right)^{T} \boldsymbol{\mu}_{\phi}+\left\|\boldsymbol{\mu}_{\phi}\right\|^{2} \\ &=K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)-\frac{2}{n} \sum_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)+\frac{1}{n^{2}} \sum_{s=1}^{n} \sum_{t=1}^{n} K\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right) \end{aligned} ∥∥ϕ(xi)−μϕ∥∥2=∥ϕ(xi)∥2−2ϕ(xi)Tμϕ+∥∥μϕ∥∥2=K(xi,xi)−n2j=1∑nK(xi,xj)+n21s=1∑nt=1∑nK(xs,xt)σ ϕ 2 = 1 n ∑ i = 1 n ∥ ϕ ( x i ) − μ ϕ ∥ 2 = 1 n ∑ i = 1 n ( K ( x i , x i ) − 2 n ∑ j = 1 n K ( x i , x j ) + 1 n 2 ∑ s = 1 n ∑ t = 1 n K ( x s , x t ) ) = 1 n ∑ i = 1 n K ( x i , x i ) − 2 n 2 ∑ i = 1 n ∑ j = 1 n K ( x i , x j ) + 1 n 2 ∑ s = 1 n ∑ t = 1 n K ( x s , x t ) = 1 n ∑ i = 1 n K ( x i , x i ) − 1 n 2 ∑ i = 1 n ∑ j = 1 n K ( x i , x j ) \begin{aligned} \sigma_{\phi}^{2} &=\frac{1}{n} \sum\limits_{i=1}^{n}\left\|\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi}\right\|^{2}\\ &=\frac{1}{n} \sum\limits_{i=1}^{n}\left(K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)-\frac{2}{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)+\frac{1}{n^{2}} \sum\limits_{s=1}^{n} \sum\limits_{t=1}^{n} K\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right)\right)\\ &=\frac{1}{n} \sum\limits_{i=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)-\frac{2}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)+\frac{1}{n^{2}} \sum\limits_{s=1}^{n} \sum\limits_{t=1}^{n} K\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right)\\ &=\frac{1}{n} \sum\limits_{i=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)-\frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \end{aligned} σϕ2=n1i=1∑n∥∥ϕ(xi)−μϕ∥∥2=n1i=1∑n(K(xi,xi)−n2j=1∑nK(xi,xj)+n21s=1∑nt=1∑nK(xs,xt))=n1i=1∑nK(xi,xi)−n22i=1∑nj=1∑nK(xi,xj)+n21s=1∑nt=1∑nK(xs,xt)=n1i=1∑nK(xi,xi)−n21i=1∑nj=1∑nK(xi,xj)
1 n ∑ i = 1 n K ( x i , x i ) \frac{1}{n} \sum\limits_{i=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right) n1i=1∑nK(xi,xi) 是 K \mathbf{K} K 对角线平均值, 1 n 2 ∑ i = 1 n ∑ j = 1 n K ( x i , x j ) \frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) n21i=1∑nj=1∑nK(xi,xj) 是 K \mathbf{K} K 所有元的平均值。
-
中心化核矩阵:令 ϕ ^ ( x i ) = ϕ ( x i ) − μ ϕ \hat{\phi}\left(\mathbf{x}_{i}\right)=\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi} ϕ^(xi)=ϕ(xi)−μϕ
中心核函数 K ^ ( x i , x j ) = ϕ ^ ( x i ) T ϕ ^ ( x j ) \hat{K}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) =\hat{\phi}\left(\mathbf{x}_{i}\right)^{T} \hat{\phi}\left(\mathbf{x}_{j}\right) K^(xi,xj)=ϕ^(xi)Tϕ^(xj)
K ^ ( x i , x j ) = ( ϕ ( x i ) − μ ϕ ) T ( ϕ ( x j ) − μ ϕ ) = ϕ ( x i ) T ϕ ( x j ) − ϕ ( x i ) T μ ϕ − ϕ ( x j ) T μ ϕ + μ ϕ T μ ϕ = K ( x i , x j ) − 1 n ∑ k = 1 n ϕ ( x i ) T ϕ ( x k ) − 1 n ∑ k = 1 n ϕ ( x j ) T ϕ ( x k ) + ∥ μ ϕ ∥ 2 = K ( x i , x j ) − 1 n ∑ k = 1 n K ( x i , x k ) − 1 n ∑ k = 1 n K ( x j , x k ) + 1 n 2 ∑ s = 1 n ∑ t = 1 n K ( x s , x t ) \begin{aligned} \hat{K}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) &=\left(\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi}\right)^{T}\left(\phi\left(\mathbf{x}_{j}\right)-\boldsymbol{\mu}_{\phi}\right) \\ &=\phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right)-\phi\left(\mathbf{x}_{i}\right)^{T} \boldsymbol{\mu}_{\phi}-\phi\left(\mathbf{x}_{j}\right)^{T} \boldsymbol{\mu}_{\phi}+\boldsymbol{\mu}_{\phi}^{T} \boldsymbol{\mu}_{\phi}\\ &=K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)-\frac{1}{n} \sum_{k=1}^{n} \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{k}\right)-\frac{1}{n} \sum_{k=1}^{n} \phi\left(\mathbf{x}_{j}\right)^{T} \phi\left(\mathbf{x}_{k}\right)+\left\|\boldsymbol{\mu}_{\phi}\right\|^{2} \\ &=K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)-\frac{1}{n} \sum_{k=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{k}\right)-\frac{1}{n} \sum_{k=1}^{n} K\left(\mathbf{x}_{j}, \mathbf{x}_{k}\right)+\frac{1}{n^{2}} \sum_{s=1}^{n} \sum_{t=1}^{n} K\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right) \end{aligned} K^(xi,xj)=(ϕ(xi)−μϕ)T(ϕ(xj)−μϕ)=ϕ(xi)Tϕ(xj)−ϕ(xi)Tμϕ−ϕ(xj)Tμϕ+μϕTμϕ=K(xi,xj)−n1k=1∑nϕ(xi)Tϕ(xk)−n1k=1∑nϕ(xj)Tϕ(xk)+∥∥μϕ∥∥2=K(xi,xj)−n1k=1∑nK(xi,xk)−n1k=1∑nK(xj,xk)+n21s=1∑nt=1∑nK(xs,xt)
故
K ^ = K − 1 n 1 n × n K − 1 n K 1 n × n + 1 n 2 1 n × n K 1 n × n = ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) \begin{aligned} \hat{\mathbf{K}} &=\mathbf{K}-\frac{1}{n} \mathbf{1}_{n \times n} \mathbf{K}-\frac{1}{n} \mathbf{K} \mathbf{1}_{n \times n}+\frac{1}{n^{2}} \mathbf{1}_{n \times n} \mathbf{K} \mathbf{1}_{n \times n} \\ &=\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \mathbf{K}\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \end{aligned} K^=K−n11n×nK−n1K1n×n+n211n×nK1n×n=(I−n11n×n)K(I−n11n×n)
注意: 1 n × n \mathbf{1}_{n \times n} 1n×n 为全1矩阵,左乘之后每个元素为之前所在列的列和,右乘之和每个元素为之前所在行的行和,左右都乘之后每个元素即为原来所以元素之后。 -
归一化核矩阵
K n ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) ∥ ϕ ( x i ) ∥ ⋅ ∥ ϕ ( x j ) ∥ = K ( x i , x j ) K ( x i , x i ) ⋅ K ( x j , x j ) \mathbf{K}_{n}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)=\frac{\phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right)}{\left\|\phi\left(\mathbf{x}_{i}\right)\right\| \cdot\left\|\phi\left(\mathbf{x}_{j}\right)\right\|}=\frac{K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)}{\sqrt{K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right) \cdot K\left(\mathbf{x}_{j}, \mathbf{x}_{j}\right)}} Kn(xi,xj)=∥ϕ(xi)∥⋅∥ϕ(xj)∥ϕ(xi)Tϕ(xj)=K(xi,xi)⋅K(xj,xj)K(xi,xj)
令 W = diag ( K ) = ( K ( x 1 , x 1 ) 0 ⋯ 0 0 K ( x 2 , x 2 ) ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ K ( x n , x n ) ) \mathbf{W}=\operatorname{diag}(\mathbf{K})=\left(\begin{array}{cccc} K\left(\mathbf{x}_{1}, \mathbf{x}_{1}\right) & 0 & \cdots & 0 \\ 0 & K\left(\mathbf{x}_{2}, \mathbf{x}_{2}\right) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & K\left(\mathbf{x}_{n}, \mathbf{x}_{n}\right) \end{array}\right) W=diag(K)=⎝⎜⎜⎜⎛K(x1,x1)0⋮00K(x2,x2)⋮0⋯⋯⋱⋯00⋮K(xn,xn)⎠⎟⎟⎟⎞,则 W − 1 / 2 ( x i , x i ) = 1 K ( x i , x i ) \mathbf{W}^{-1 / 2}\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)=\frac{1}{\sqrt{K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)}} W−1/2(xi,xi)=K(xi,xi)1,
K n = W − 1 / 2 ⋅ K ⋅ W − 1 / 2 \mathbf{K}_{n}=\mathbf{W}^{-1 / 2} \cdot \mathbf{K} \cdot \mathbf{W}^{-1 / 2} Kn=W−1/2⋅K⋅W−1/2
矩阵左乘右乘对角阵的性质。
5.4 复杂对象的核
5.4.1 字串的谱核
考虑字符集
Σ
\Sigma
Σ (有限),定义
l
l
l-谱特征映射:
ϕ
l
:
Σ
∗
→
R
∣
Σ
∣
l
,
∀
x
∈
Σ
∗
,
ϕ
l
(
x
)
=
(
⋯
,
#
(
α
)
,
⋯
)
T
\phi_l: \Sigma^{*} \rightarrow \mathbb{R}^{|\Sigma|^l},\forall \mathbf{x} \in \Sigma^{*},\phi_l(\mathbf{x})=\left ( \cdots,\#(\alpha),\cdots \right )^T
ϕl:Σ∗→R∣Σ∣l,∀x∈Σ∗,ϕl(x)=(⋯,#(α),⋯)T
其中
#
(
α
)
\#(\alpha)
#(α) 代表长度为
l
l
l 的子字串在
x
\mathbf{x}
x 中出现的次数。
l l l-谱核函数: K l : Σ ∗ × Σ ∗ → R \mathbf{K}_l:\Sigma^{*} \times \Sigma^{*} \to \mathbb{R} Kl:Σ∗×Σ∗→R, K l ( x , y ) = ϕ l ( x ) T ϕ l ( y ) \mathbf{K}_l (\mathbf{x},\mathbf{y})=\phi_l(\mathbf{x})^{T} \phi_l(\mathbf{y}) Kl(x,y)=ϕl(x)Tϕl(y)
谱核函数:计算 l = 0 l=0 l=0 到 l = ∞ l=\infty l=∞
5.4.2 图顶点的扩散核
-
图:图 G = ( V , E ) G=(V,E) G=(V,E) 是指一个集合对,其中 V = { v 1 , ⋯ , v n } V=\{v_1,\cdots,v_n\} V={v1,⋯,vn} 为顶点集, E = { ( v i , v j ) } E=\{(v_i,v_j)\} E={(vi,vj)} 为边集。现只考虑无向简单(没有自己到自己的边)图。
-
邻接矩阵:图的邻接矩阵 A ( G ) : = [ A i j ] n × n A(G):=[A_{ij}]_{n\times n} A(G):=[Aij]n×n,其中 A i j = { 1 , ( v i , v j ) ∈ E 0 , ( v i , v j ) ∉ E A_{ij}=\left\{\begin{matrix} 1, (v_i,v_j)\in E\\ 0, (v_i,v_j) \notin E \end{matrix}\right. Aij={1,(vi,vj)∈E0,(vi,vj)∈/E
-
度矩阵: Δ ( G ) : = d i a g ( d 1 , ⋯ , d n ) \Delta (G):=diag(d_1,\cdots,d_n) Δ(G):=diag(d1,⋯,dn),其中 d i d_i di 代表顶点 v i v_i vi 的度,即与 v i v_i vi 相连的边的数目。
-
拉普拉斯矩阵: L ( G ) : = A ( G ) − Δ ( G ) L(G):=A(G)-\Delta(G) L(G):=A(G)−Δ(G)
负拉普拉斯矩阵: L ( G ) : = − L ( G ) L(G):=-L(G) L(G):=−L(G)
它们是实对称
常用图的对称相似性矩阵 S \mathbf{S} S 是指 A ( G ) A(G) A(G), L ( G ) L(G) L(G) 或 − L ( G ) -L(G) −L(G)。
问题:如何定义图顶点的核函数?( S \mathbf{S} S 并不一定是半正定)
- 幂核函数
以 S t \mathbf{S}^t St 作为核矩阵, S \mathbf{S} S 是对称的, t t t 为正整数
考虑 S 2 \mathbf{S}^2 S2: S 2 ( x i , x j ) = ∑ k = 1 n S i k S k j \mathbf{S}^2(x_i,x_j)=\sum\limits_{k=1}^{n}S_{ik}S_{kj} S2(xi,xj)=k=1∑nSikSkj
此公式说明 S 2 \mathbf{S}^2 S2 ( S l \mathbf{S}^l Sl) 的几何意义:顶点间长度为 2 2 2 ( l ) (l) (l) 的路径,描述顶点的相似性。
考虑
S
l
\mathbf{S}^l
Sl 的特征值:设
S
\mathbf{S}
S 的特征值为
λ
1
,
⋯
,
λ
n
∈
R
\lambda_1,\cdots,\lambda_n \in \mathbb{R}
λ1,⋯,λn∈R,则
S
=
U
(
λ
1
0
⋯
0
0
λ
2
⋯
0
⋮
⋮
⋱
⋮
0
0
⋯
λ
n
)
U
T
=
∑
i
=
1
n
λ
i
u
i
u
i
T
\mathbf{S}=\mathbf{U}\left(\begin{array}{cccc} \lambda_{1} & 0 & \cdots & 0 \\ 0 & \lambda_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{n} \end{array}\right)\mathbf{U}^{T}=\sum_{i=1}^{n} \lambda_{i}\mathbf{u}_{i} \mathbf{u}_{i}^{T}
S=U⎝⎜⎜⎜⎛λ10⋮00λ2⋮0⋯⋯⋱⋯00⋮λn⎠⎟⎟⎟⎞UT=i=1∑nλiuiuiT
其中
U
\mathbf{U}
U 是以相应特征向量为列的正交矩阵,
U
=
(
∣
∣
∣
u
1
u
2
⋯
u
n
∣
∣
∣
)
\mathbf{U}=\left(\begin{array}{cccc} \mid & \mid & & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} & \cdots & \mathbf{u}_{n} \\ \mid & \mid & & \mid \end{array}\right)
U=⎝⎛∣u1∣∣u2∣⋯∣un∣⎠⎞
S
l
=
U
(
λ
1
l
0
⋯
0
0
λ
2
l
⋯
0
⋮
⋮
⋱
⋮
0
0
⋯
λ
n
l
)
U
T
\mathbf{S}^l=\mathbf{U}\left(\begin{array}{cccc} \lambda_{1}^l & 0 & \cdots & 0 \\ 0 & \lambda_{2}^l & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{n}^l \end{array}\right)\mathbf{U}^{T}
Sl=U⎝⎜⎜⎜⎛λ1l0⋮00λ2l⋮0⋯⋯⋱⋯00⋮λnl⎠⎟⎟⎟⎞UT
λ
1
l
,
⋯
,
λ
n
l
\lambda_1^l,\cdots,\lambda_n^l
λ1l,⋯,λnl 是
S
l
\mathbf{S}^l
Sl 的特征值。
故若 l l l 是偶数, S l \mathbf{S}^l Sl 半正定。
- 指数扩散核函数
以
K
:
=
e
β
S
\mathbf{K}:=e^{\beta \mathbf{S}}
K:=eβS 为核矩阵,其中
β
>
0
\beta >0
β>0 为阻尼系数。(泰勒展开:
e
β
x
=
∑
l
=
0
∞
1
l
!
β
l
x
l
e^{\beta x}=\sum\limits_{l=0}^{\infin}\frac{1}{l!}\beta^{l}x^l
eβx=l=0∑∞l!1βlxl)
e
β
S
=
∑
l
=
0
∞
1
l
!
β
′
S
l
=
I
+
β
S
+
1
2
!
β
2
S
2
+
1
3
!
β
3
S
3
+
⋯
=
(
∑
i
=
1
n
u
i
u
i
T
)
+
(
∑
i
=
1
n
u
i
β
λ
i
u
i
T
)
+
(
∑
i
=
1
n
u
i
1
2
!
β
2
λ
i
2
u
i
T
)
+
⋯
=
∑
i
=
1
n
u
i
(
1
+
β
λ
i
+
1
2
!
β
2
λ
i
2
+
⋯
)
u
i
T
=
∑
i
=
1
n
u
i
e
β
λ
i
u
i
T
=
U
(
e
β
λ
1
0
⋯
0
0
e
β
λ
2
⋯
0
⋮
⋮
⋱
0
0
0
⋯
e
β
λ
n
)
U
T
\begin{aligned} e^{\beta \mathbf{S}} &=\sum_{l=0}^{\infty} \frac{1}{l !} \beta^{\prime} \mathbf{S}^{l} \\ &=\mathbf{I}+\beta \mathbf{S}+\frac{1}{2 !} \beta^{2} \mathbf{S}^{2}+\frac{1}{3 !} \beta^{3} \mathbf{S}^{3}+\cdots\\ &=\left(\sum_{i=1}^{n} \mathbf{u}_{i} \mathbf{u}_{i}^{T}\right)+\left(\sum_{i=1}^{n} \mathbf{u}_{i} \beta \lambda_{i} \mathbf{u}_{i}^{T}\right)+\left(\sum_{i=1}^{n} \mathbf{u}_{i} \frac{1}{2 !} \beta^{2} \lambda_{i}^{2} \mathbf{u}_{i}^{T}\right)+\cdots \\ &=\sum_{i=1}^{n} \mathbf{u}_{i}\left(1+\beta \lambda_{i}+\frac{1}{2 !} \beta^{2} \lambda_{i}^{2}+\cdots\right) \mathbf{u}_{i}^{T} \\ &=\sum_{i=1}^{n} \mathbf{u}_{i} e ^{\beta \lambda_{i}} \mathbf{u}_{i}^{T} \\ &=\mathbf{U}\left(\begin{array}{cccc} e ^{\beta \lambda_{1}} & 0 & \cdots & 0 \\ 0 & e ^{\beta \lambda_{2}} & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & e ^{\beta \lambda_{n}} \end{array}\right) \mathbf{U}^{T} \end{aligned}
eβS=l=0∑∞l!1β′Sl=I+βS+2!1β2S2+3!1β3S3+⋯=(i=1∑nuiuiT)+(i=1∑nuiβλiuiT)+(i=1∑nui2!1β2λi2uiT)+⋯=i=1∑nui(1+βλi+2!1β2λi2+⋯)uiT=i=1∑nuieβλiuiT=U⎝⎜⎜⎜⎛eβλ10⋮00eβλ2⋮0⋯⋯⋱⋯000eβλn⎠⎟⎟⎟⎞UT
故
K
\mathbf{K}
K 的特征值
e
β
λ
1
,
⋯
,
e
β
λ
n
e ^{\beta \lambda_{1}},\cdots,e ^{\beta \lambda_{n}}
eβλ1,⋯,eβλn 完全非负,
K
\mathbf{K}
K 为半正定。
- 纽因曼扩散核函数
以
K
=
∑
l
=
0
∞
β
l
S
l
\mathbf{K}=\sum\limits_{l=0}^{\infty} \beta^l \mathbf{S}^l
K=l=0∑∞βlSl 为核矩阵,注意到
K
=
I
+
β
S
+
β
2
S
2
+
β
3
S
3
+
⋯
=
I
+
β
S
(
I
+
β
S
+
β
2
S
2
+
⋯
)
=
I
+
β
S
K
\begin{aligned} \mathbf{K} &=\mathbf{I}+\beta \mathbf{S}+\beta^{2} \mathbf{S}^{2}+\beta^{3} \mathbf{S}^{3}+\cdots \\ &=\mathbf{I}+\beta \mathbf{S}\left(\mathbf{I}+\beta \mathbf{S}+\beta^{2} \mathbf{S}^{2}+\cdots\right) \\ &=\mathbf{I}+\beta \mathbf{S} \mathbf{K} \end{aligned}
K=I+βS+β2S2+β3S3+⋯=I+βS(I+βS+β2S2+⋯)=I+βSK
故
K
−
β
S
K
=
I
(
I
−
β
S
)
K
=
I
K
=
(
I
−
β
S
)
−
1
\begin{aligned} \mathbf{K}-\beta \mathbf{S} \mathbf{K} &=\mathbf{I} \\ (\mathbf{I}-\beta \mathbf{S}) \mathbf{K} &=\mathbf{I} \\ \mathbf{K} &=(\mathbf{I}-\beta \mathbf{S})^{-1} \end{aligned}
K−βSK(I−βS)KK=I=I=(I−βS)−1
在
I
−
β
S
\mathbf{I}-\beta \mathbf{S}
I−βS 可逆的前提下
K
=
(
U
U
T
−
U
(
β
Λ
)
U
T
)
−
1
=
(
U
(
I
−
β
Λ
)
U
T
)
−
1
=
U
(
I
−
β
Λ
)
−
1
U
T
=
U
(
1
1
−
β
λ
1
0
⋯
0
0
1
1
−
β
λ
2
⋯
0
⋮
⋮
⋱
0
0
0
⋯
1
1
−
β
λ
n
)
U
T
\begin{aligned} \mathbf{K} &=\left(\mathbf{U} \mathbf{U}^{T}-\mathbf{U}(\beta \mathbf{\Lambda}) \mathbf{U}^{T}\right)^{-1} \\ &=\left(\mathbf{U}(\mathbf{I}-\beta \mathbf{\Lambda}) \mathbf{U}^{T}\right)^{-1} \\ &=\mathbf{U}(\mathbf{I}-\beta \mathbf{\Lambda})^{-1} \mathbf{U}^{T}\\ &=\mathbf{U}\left(\begin{array}{cccc} \frac{1}{1-\beta\lambda_1} & 0 & \cdots & 0 \\ 0 & \frac{1}{1-\beta\lambda_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & \frac{1}{1-\beta\lambda_n} \end{array}\right) \mathbf{U}^{T} \end{aligned}
K=(UUT−U(βΛ)UT)−1=(U(I−βΛ)UT)−1=U(I−βΛ)−1UT=U⎝⎜⎜⎜⎛1−βλ110⋮001−βλ21⋮0⋯⋯⋱⋯0001−βλn1⎠⎟⎟⎟⎞UT
要想其半正定,故有:
(
1
−
β
λ
i
)
−
1
≥
0
1
−
β
λ
i
≥
0
β
λ
i
≤
1
\begin{aligned} \left(1-\beta \lambda_{i}\right)^{-1} & \geq 0 \\ 1-\beta \lambda_{i} & \geq 0 \\ \beta \lambda_{i} & \leq 1 \end{aligned}
(1−βλi)−11−βλiβλi≥0≥0≤1
Chapter 7:降维
PCA:主元分析
7.1 背景
D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D=⎝⎜⎜⎜⎜⎜⎛x1x2⋮xnX1x11x21⋮xn1X2x12x22⋮xn2⋯⋯⋯⋱⋯Xdx1dx2d⋮xnd⎠⎟⎟⎟⎟⎟⎞
对象: x 1 T , ⋯ , x n T ∈ R d \mathbf{x}_{1}^T,\cdots,\mathbf{x}_n^T \in \mathbb{R}^d x1T,⋯,xnT∈Rd, ∀ x ∈ R d , \forall \mathbf{x} \in \mathbb{R}^d, ∀x∈Rd, 设 x = ( x 1 , ⋯ , x d ) T = ∑ i = 1 d x i e i \mathbf{x}=(x_1,\cdots,x_d)^T= \sum\limits_{i=1}^{d}x_i \mathbf{e}_i x=(x1,⋯,xd)T=i=1∑dxiei
其中, e i = ( 0 , ⋯ , 1 , ⋯ , 0 ) T ∈ R d \mathbf{e}_i=(0,\cdots,1,\cdots,0)^T\in\mathbb{R}^d ei=(0,⋯,1,⋯,0)T∈Rd,i-坐标
设另有单位正交基 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n, x = ∑ i = 1 d a i u i , a i ∈ R \mathbf{x}=\sum\limits_{i=1}^{d}a_i \mathbf{u}_i,a_i \in \mathbb{R} x=i=1∑daiui,ai∈R, u i T u j = { 1 , i = j 0 , i ≠ j \mathbf{u}_i^T \mathbf{u}_j =\left\{\begin{matrix} 1,i=j\\ 0,i\ne j \end{matrix}\right. uiTuj={1,i=j0,i=j
∀ r : 1 ≤ r ≤ d , x = a 1 u 1 + ⋯ + a r u r ⏟ 投影 + a r + 1 u r + 1 + ⋯ + a d u d ⏟ 误差 \forall r:1\le r\le d, \mathbf{x}=\underbrace{a_1 \mathbf{u}_1+\cdots+a_r \mathbf{u}_r}_{\text{投影}}+ \underbrace{a_{r+1} \mathbf{u}_{r+1}+\cdots+a_d \mathbf{u}_d}_{\text{误差}} ∀r:1≤r≤d,x=投影 a1u1+⋯+arur+误差 ar+1ur+1+⋯+adud
前 r r r 项是投影,后面是投影误差。
目标:对于给定 D D D,寻找最优 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n,使得 D D D 在其前 r r r 维子空间的投影是对 D D D 的“最佳近似”,即投影之后“误差最小”。
7.2 主元分析:
7.2.1 最佳直线近似
(一阶主元分析)(r=1)
目标:寻找 u 1 \mathbf{u}_1 u1,不妨记为 u = ( u 1 , ⋯ , u d ) T \mathbf{u}=(u_1,\cdots,u_d)^T u=(u1,⋯,ud)T。
假设: ∣ ∣ u ∣ ∣ = u T u = 1 ||\mathbf{u}||=\mathbf{u}^T\mathbf{u}=1 ∣∣u∣∣=uTu=1, μ ^ = 1 n ∑ i = 1 n x i = 0 , ∈ R d \hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^n\mathbf{x}_i=\mathbf{0},\in \mathbb{R}^{d} μ^=n1i=1∑nxi=0,∈Rd
∀
x
i
(
i
=
1
,
⋯
,
n
)
\forall \mathbf{x}_i(i=1,\cdots,n)
∀xi(i=1,⋯,n),
x
i
\mathbf{x}_i
xi 沿
u
\mathbf{u}
u 方向投影是:
x
i
′
=
(
u
T
x
i
u
T
u
)
u
=
(
u
T
x
i
)
u
=
a
i
u
,
a
i
=
u
T
x
i
\mathbf{x}_{i}^{\prime}=\left(\frac{\mathbf{u}^{T} \mathbf{x}_{i}}{\mathbf{u}^{T} \mathbf{u}}\right) \mathbf{u}=\left(\mathbf{u}^{T} \mathbf{x}_{i}\right) \mathbf{u}=a_{i} \mathbf{u},a_{i}=\mathbf{u}^{T} \mathbf{x}_{i}
xi′=(uTuuTxi)u=(uTxi)u=aiu,ai=uTxi
μ
^
=
0
⇒
\hat{\boldsymbol{\mu}}=\mathbf{0}\Rightarrow
μ^=0⇒
μ
^
\hat{\boldsymbol{\mu}}
μ^ 在
u
\mathbf{u}
u 上投影是0;
x
1
′
,
⋯
,
x
n
′
\mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime}
x1′,⋯,xn′ 的平均值为0 。
P r o j ( m e a n ( D ) ) = m e a n P r o j ( D ) Proj(mean(D))=mean{Proj(D)} Proj(mean(D))=meanProj(D)
考察
x
1
′
,
⋯
,
x
n
′
\mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime}
x1′,⋯,xn′ 沿
u
\mathbf{u}
u 方向的样本方差:
σ
u
2
=
1
n
∑
i
=
1
n
(
a
i
−
μ
u
)
2
=
1
n
∑
i
=
1
n
(
u
T
x
i
)
2
=
1
n
∑
i
=
1
n
u
T
(
x
i
x
i
T
)
u
=
u
T
(
1
n
∑
i
=
1
n
x
i
x
i
T
)
u
=
u
T
Σ
u
\begin{aligned} \sigma_{\mathbf{u}}^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(a_{i}-\mu_{\mathbf{u}}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{u}^{T} \mathbf{x}_{i}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{u}^{T}\left(\mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T}\left(\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \end{aligned}
σu2=n1i=1∑n(ai−μu)2=n1i=1∑n(uTxi)2=n1i=1∑nuT(xixiT)u=uT(n1i=1∑nxixiT)u=uTΣu
Σ
\mathbf{\Sigma}
Σ 是样本协方差矩阵。
目标:
max
u
u
T
Σ
u
s.t
u
T
u
−
1
=
0
\begin{array}{ll} \max\limits_{\mathbf{u}} & \mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \\ \text{s.t} & \mathbf{u}^T\mathbf{u}-1=0 \end{array}
umaxs.tuTΣuuTu−1=0
应用 Lagrangian 乘数法:
max
u
J
(
u
)
=
u
T
Σ
u
−
λ
(
u
T
u
−
1
)
\max \limits_{\mathbf{u}} J(\mathbf{u})=\mathbf{u}^{T} \Sigma \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right)
umaxJ(u)=uTΣu−λ(uTu−1)
求偏导:
∂
∂
u
J
(
u
)
=
0
∂
∂
u
(
u
T
Σ
u
−
λ
(
u
T
u
−
1
)
)
=
0
2
Σ
u
−
2
λ
u
=
0
Σ
u
=
λ
u
\begin{aligned} \frac{\partial}{\partial \mathbf{u}} J(\mathbf{u}) &=\mathbf{0} \\ \frac{\partial}{\partial \mathbf{u}}\left(\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right)\right) &=\mathbf{0} \\ 2 \mathbf{\Sigma} \mathbf{u}-2 \lambda \mathbf{u} &=\mathbf{0} \\ \mathbf{\Sigma} \mathbf{u} &=\lambda \mathbf{u} \end{aligned}
∂u∂J(u)∂u∂(uTΣu−λ(uTu−1))2Σu−2λuΣu=0=0=0=λu
注意到:
u
T
Σ
u
=
u
T
λ
u
=
λ
\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}=\mathbf{u}^{T} \lambda \mathbf{u}=\lambda
uTΣu=uTλu=λ
故优化问题的解 λ \lambda λ 选取 Σ \mathbf{\Sigma} Σ 最大特征值, u \mathbf{u} u 选与 λ \lambda λ 相应的单位特征向量。
问题:上述问题使得 σ u 2 \sigma_{\mathbf{u}}^{2} σu2 最大的 u \mathbf{u} u 能否使投影误差最小?
定义平均平方误差(Minimum Squared Error,MSE):
M
S
E
(
u
)
=
1
n
∑
i
=
1
n
∥
x
i
−
x
i
′
∥
2
=
1
n
∑
i
=
1
n
(
x
i
−
x
i
′
)
T
(
x
i
−
x
i
′
)
=
1
n
∑
i
=
1
n
(
∥
x
i
∥
2
−
2
x
i
T
x
i
′
+
(
x
i
′
)
T
x
i
′
)
=
1
n
∑
i
=
1
n
(
∥
x
i
∥
2
−
2
x
i
T
(
u
T
x
i
)
u
+
[
(
u
T
x
i
)
u
]
T
[
(
u
T
x
i
)
u
]
)
=
1
n
∑
i
=
1
n
(
∥
x
i
∥
2
−
2
(
u
T
x
i
)
x
i
T
u
+
(
u
T
x
i
)
(
x
i
T
u
)
u
T
u
)
=
1
n
∑
i
=
1
n
(
∥
x
i
∥
2
−
u
T
x
i
x
i
T
u
)
=
1
n
∑
i
=
1
n
∥
x
i
∥
2
−
u
T
Σ
u
=
v
a
r
(
D
)
−
σ
u
2
\begin{aligned} M S E(\mathbf{u}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right)^{T}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} \mathbf{x}_{i}^{\prime}+\left(\mathbf{x}_{i}^{\prime}\right)^{T} \mathbf{x}_{i}^{\prime}\right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}+\left[(\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right]^{T} \left[ (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right] \right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{x}_{i}^{T} \mathbf{u}+(\mathbf{u}^{T} \mathbf{x}_{i})(\mathbf{x}_{i}^{T} \mathbf{u})\mathbf{u}^{T}\mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{x}_{i}\mathbf{x}_{i}^{T} \mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}\\ &= var(D)-\sigma_{\mathbf{u}}^{2} \end{aligned}
MSE(u)=n1i=1∑n∥xi−xi′∥2=n1i=1∑n(xi−xi′)T(xi−xi′)=n1i=1∑n(∥xi∥2−2xiTxi′+(xi′)Txi′)=n1i=1∑n(∥xi∥2−2xiT(uTxi)u+[(uTxi)u]T[(uTxi)u])=n1i=1∑n(∥xi∥2−2(uTxi)xiTu+(uTxi)(xiTu)uTu)=n1i=1∑n(∥xi∥2−uTxixiTu)=n1i=1∑n∥xi∥2−uTΣu=var(D)−σu2
上式表明:
v
a
r
(
D
)
=
σ
u
2
+
M
S
E
var(D)=\sigma_{\mathbf{u}}^{2}+MSE
var(D)=σu2+MSE
u \mathbf{u} u 的几何意义: R d \mathbb{R}^d Rd 中使得数据沿其方向投影后方差最大的同时,MSE 最小的直线方向。
u \mathbf{u} u 被称为一阶主元(first principal component)
7.2.2 最佳2-维近似
(二阶主元分析:r=2)
假设 u 1 \mathbf{u}_1 u1 已经找到,即 Σ \mathbf{\Sigma} Σ 的最大特征值对应的特征向量。
目标:寻找 u 2 \mathbf{u}_2 u2 ,简记为 v \mathbf{v} v,使得: v T u 1 = 0 , v T v = 1 \mathbf{v}^{T} \mathbf{u}_{1}=0,\mathbf{v}^{T} \mathbf{v} =1 vTu1=0,vTv=1
考虑
x
i
\mathbf{x}_{i}
xi 沿
v
\mathbf{v}
v 方向投影的方差:
max
u
σ
v
2
=
v
T
Σ
v
s.t
v
T
v
−
1
=
0
v
T
u
1
=
0
\begin{array}{ll} \max\limits_{\mathbf{u}} & \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} \\ \text{s.t} & \mathbf{v}^T\mathbf{v}-1=0\\ & \mathbf{v}^{T} \mathbf{u}_{1}=0 \end{array}
umaxs.tσv2=vTΣvvTv−1=0vTu1=0
定义:
J
(
v
)
=
v
T
Σ
v
−
α
(
v
T
v
−
1
)
−
β
(
v
T
u
1
−
0
)
J(\mathbf{v})=\mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v}-\alpha\left(\mathbf{v}^{T} \mathbf{v}-1\right)-\beta\left(\mathbf{v}^{T} \mathbf{u}_{1}-0\right)
J(v)=vTΣv−α(vTv−1)−β(vTu1−0)
对
v
\mathbf{v}
v 求偏导得:
2
Σ
v
−
2
α
v
−
β
u
1
=
0
2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}-\beta \mathbf{u}_{1}=\mathbf{0}
2Σv−2αv−βu1=0
两边同乘
u
1
T
\mathbf{u}_{1}^{T}
u1T:
2
u
1
T
Σ
v
−
2
α
u
1
T
v
−
β
u
1
T
u
1
=
0
2
u
1
T
Σ
v
−
β
=
0
2
v
T
Σ
u
1
−
β
=
0
2
v
T
λ
1
u
1
−
β
=
0
β
=
0
\begin{aligned} 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-2 \alpha \mathbf{u}_{1}^{T}\mathbf{v}-\beta \mathbf{u}_{1}^{T}\mathbf{u}_{1} &=0 \\ 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-\beta &= 0\\ 2 \mathbf{v}^{T}\Sigma \mathbf{u}_{1}-\beta &= 0\\ 2 \mathbf{v}^{T}\lambda_1 \mathbf{u}_{1}-\beta &= 0\\ \beta &= 0 \end{aligned}
2u1TΣv−2αu1Tv−βu1Tu12u1TΣv−β2vTΣu1−β2vTλ1u1−ββ=0=0=0=0=0
再代入到原式:
2
Σ
v
−
2
α
v
=
0
Σ
v
=
α
v
2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}=\mathbf{0}\\ \Sigma \mathbf{v}=\alpha \mathbf{v}
2Σv−2αv=0Σv=αv
故
v
\mathbf{v}
v 也是
Σ
\mathbf{\Sigma}
Σ 的特征向量。
σ v 2 = v T Σ v = α \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} =\alpha σv2=vTΣv=α,故 α \alpha α 应取 Σ \mathbf{\Sigma} Σ (第二大)的特征向量。
问题1:上述求得的 v \mathbf{v} v (即 u 2 \mathbf{u}_2 u2 ),与 u 1 \mathbf{u}_1 u1 一起考虑,能否使 D D D 在 s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影总方差最大?
设 x i = a i 1 u 1 + a i 2 u 2 ⏟ 投 影 + ⋯ \mathbf{x}_i=\underbrace{a_{i1} \mathbf{u}_1+a_{i2}\mathbf{u}_2}_{投影}+\cdots xi=投影 ai1u1+ai2u2+⋯
则 x i \mathbf{x}_i xi 在 s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影坐标: a i = ( a i 1 , a i 2 ) T = ( u 1 T x i , u 2 T x i ) T \mathbf{a}_{i}=(a_{i1},a_{i2})^T=(\mathbf{u}_1^{T}\mathbf{x}_i,\mathbf{u}_2^{T}\mathbf{x}_i)^{T} ai=(ai1,ai2)T=(u1Txi,u2Txi)T
令 U 2 = ( ∣ ∣ u 1 u 2 ∣ ∣ ) \mathbf{U}_{2}=\left(\begin{array}{cc} \mid & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} \\ \mid & \mid \end{array}\right) U2=⎝⎛∣u1∣∣u2∣⎠⎞,则 a i = U 2 T x i \mathbf{a}_{i}=\mathbf{U}_{2}^{T} \mathbf{x}_{i} ai=U2Txi
投影总方差为:
var
(
A
)
=
1
n
∑
i
=
1
n
∥
a
i
−
0
∥
2
=
1
n
∑
i
=
1
n
(
U
2
T
x
i
)
T
(
U
2
T
x
i
)
=
1
n
∑
i
=
1
n
x
i
T
(
U
2
U
2
T
)
x
i
=
1
n
∑
i
=
1
n
x
i
T
(
u
1
u
1
T
+
u
2
u
2
T
)
x
i
=
u
1
T
Σ
u
1
+
u
2
T
Σ
u
2
=
λ
1
+
λ
2
\begin{aligned} \operatorname{var}(\mathbf{A}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{a}_{i}-\mathbf{0}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right)^{T}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left( \mathbf{u}_{1}\mathbf{u}_{1}^T + \mathbf{u}_{2}\mathbf{u}_{2}^T \right) \mathbf{x}_{i}\\ &=\mathbf{u}_{1}^T\mathbf{\Sigma} \mathbf{u}_{1} + \mathbf{u}_{2}^T\mathbf{\Sigma} \mathbf{u}_{2}\\ &= \lambda_1 +\lambda_2 \end{aligned}
var(A)=n1i=1∑n∥ai−0∥2=n1i=1∑n(U2Txi)T(U2Txi)=n1i=1∑nxiT(U2U2T)xi=n1i=1∑nxiT(u1u1T+u2u2T)xi=u1TΣu1+u2TΣu2=λ1+λ2
问题2:平均平方误差是否最小?
其中,
x
i
′
=
U
2
U
2
T
x
i
\mathbf{x}_{i}^{\prime}=\mathbf{U}_{2}\mathbf{U}_{2}^{T} \mathbf{x}_{i}
xi′=U2U2Txi
M
S
E
=
1
n
∑
i
=
1
n
∥
x
i
−
x
i
′
∥
2
=
1
n
∑
i
=
1
n
∥
x
i
∥
2
−
1
n
∑
i
=
1
n
x
i
T
(
U
2
U
2
T
)
x
i
=
v
a
r
(
D
)
−
λ
1
−
λ
2
\begin{aligned} M S E &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2} - \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &= var(D) - \lambda_1 - \lambda_2 \end{aligned}
MSE=n1i=1∑n∥xi−xi′∥2=n1i=1∑n∥xi∥2−n1i=1∑nxiT(U2U2T)xi=var(D)−λ1−λ2
结论:
- Σ \mathbf{\Sigma} Σ 的前 r r r 个特征值的和 λ 1 + ⋯ + λ r ( λ 1 ≥ ⋯ ≥ λ r ) \lambda_1+\cdots+\lambda_r(\lambda_1\ge\cdots\ge\lambda_r) λ1+⋯+λr(λ1≥⋯≥λr) 给出最大投影总方差;
- v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r \lambda_i var(D)−i=1∑rλi 给出最小MSE;
- λ 1 , ⋯ , λ r \lambda_1,\cdots,\lambda_r λ1,⋯,λr 相应的特征向量 u 1 , ⋯ u r \mathbf{u}_{1},\cdots\mathbf{u}_{r} u1,⋯ur 张成 r r r - 阶主元。
7.2.3 推广
Σ d × d \Sigma_{d\times d} Σd×d , λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1≥λ2≥⋯λd,中心化
∑ i = 1 r λ i \sum\limits_{i=1}^r\lambda_i i=1∑rλi:最大投影总方差;
v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r\lambda_i var(D)−i=1∑rλi:最小MSE
实践: 如何选取适当的 r r r,考虑比值 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1∑rλi 与给定阈值 α \alpha α 比较
算法 7.1 PCA:
输入: D D D, α \alpha α
输出: A A A (降维后)
- μ = 1 n ∑ i = 1 r x i \boldsymbol{\mu} = \frac{1}{n}\sum\limits_{i=1}^r\mathbf{x}_i μ=n1i=1∑rxi;
- Z = D − 1 ⋅ μ T \mathbf{Z}=\mathbf{D}-\mathbf{1}\cdot \boldsymbol{\mu} ^T Z=D−1⋅μT;
- Σ = 1 n ( Z T Z ) \mathbf{\Sigma}=\frac{1}{n}(\mathbf{Z}^T\mathbf{Z}) Σ=n1(ZTZ);
- λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1≥λ2≥⋯λd, ⟵ Σ \longleftarrow \mathbf{\Sigma} ⟵Σ 的特征值(降序排列);
- u 1 , u 2 , ⋯ , u d \mathbf{u}_1,\mathbf{u}_2,\cdots,\mathbf{u}_d u1,u2,⋯,ud, ⟵ Σ \longleftarrow \mathbf{\Sigma} ⟵Σ 的特征向量(单位正交);
- 计算 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1∑rλi,选取其比值超过 α \alpha α 最小的 r r r;
- U r = ( u 1 , u 2 , ⋯ , u r ) \mathbf{U}_r=(\mathbf{u}_1,\mathbf{u}_2,\cdots,\mathbf{u}_r) Ur=(u1,u2,⋯,ur);
- A = { a i ∣ a i = U r T x i , i = 1 , ⋯ , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{U}_r^T\mathbf{x}_i, i=1,\cdots,n\} A={ai∣ai=UrTxi,i=1,⋯,n}。
7.2.3 Kernel PCA:核主元分析
ϕ : I → F ⊆ R d \phi:\mathcal{I}\to \mathcal{F}\subseteq \mathbb{R}^d ϕ:I→F⊆Rd
K : I × I → R K:\mathcal{I}\times\mathcal{I}\to \mathbb{R} K:I×I→R
K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) K(\mathbf{x}_i,\mathbf{x}_j)=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j) K(xi,xj)=ϕT(xi)ϕ(xj)
已知: K = [ K ( x i , x j ) ] n × n \mathbf{K}=[K(\mathbf{x}_i,\mathbf{x}_j)]_{n\times n} K=[K(xi,xj)]n×n, Σ ϕ = 1 n ∑ i = 1 n ϕ ( x i ) ϕ ( x i ) T \mathbf{\Sigma}_{\phi}=\frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T Σϕ=n1i=1∑nϕ(xi)ϕ(xi)T
对象: ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯ , ϕ ( x n ) ∈ R d \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n)\in \mathbb{R}^d ϕ(x1),ϕ(x2),⋯,ϕ(xn)∈Rd,假设 1 n ∑ i n ϕ ( x i ) = 0 \frac{1}{n}\sum\limits_{i}^{n}\phi(\mathbf{x}_i)=\mathbf{0} n1i∑nϕ(xi)=0, K → K ^ \mathbf{K} \to \hat{\mathbf{K}} K→K^,已经中心化;
目标:
u
,
λ
,
s
.
t
.
Σ
ϕ
u
=
λ
u
\mathbf{u},\lambda,s.t. \mathbf{\Sigma}_{\phi}\mathbf{u}=\lambda\mathbf{u}
u,λ,s.t.Σϕu=λu
1
n
∑
i
=
1
n
ϕ
(
x
i
)
[
ϕ
(
x
i
)
T
u
]
=
λ
u
∑
i
=
1
n
[
ϕ
(
x
i
)
T
u
n
λ
]
ϕ
(
x
i
)
=
u
\begin{aligned} \frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)[\phi(\mathbf{x}_i)^T\mathbf{u}] &=\lambda\mathbf{u}\\ \sum\limits_{i=1}^n[\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda}] \phi(\mathbf{x}_i)&=\mathbf{u}\\ \end{aligned}
n1i=1∑nϕ(xi)[ϕ(xi)Tu]i=1∑n[nλϕ(xi)Tu]ϕ(xi)=λu=u
相同于所有数据线性组合。
令:
c
i
=
ϕ
(
x
i
)
T
u
n
λ
c_i=\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda}
ci=nλϕ(xi)Tu,则
u
=
∑
i
=
1
n
c
i
ϕ
(
x
i
)
\mathbf{u}=\sum\limits_{i=1}^nc_i \phi(\mathbf{x}_i)
u=i=1∑nciϕ(xi)。代入原式:
(
1
n
∑
i
=
1
n
ϕ
(
x
i
)
ϕ
(
x
i
)
T
)
(
∑
j
=
1
n
c
j
ϕ
(
x
j
)
)
=
λ
∑
i
=
1
n
c
i
ϕ
(
x
i
)
1
n
∑
i
=
1
n
∑
j
=
1
n
c
j
ϕ
(
x
i
)
ϕ
(
x
i
)
T
ϕ
(
x
j
)
=
λ
∑
i
=
1
n
c
i
ϕ
(
x
i
)
∑
i
=
1
n
(
ϕ
(
x
i
)
∑
j
=
1
n
c
j
K
(
x
i
,
x
j
)
)
=
n
λ
∑
i
=
1
n
c
i
ϕ
(
x
i
)
\begin{aligned} \left(\frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T}\right)\left(\sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{j}\right)\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(\phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \end{aligned}
(n1i=1∑nϕ(xi)ϕ(xi)T)(j=1∑ncjϕ(xj))n1i=1∑nj=1∑ncjϕ(xi)ϕ(xi)Tϕ(xj)i=1∑n(ϕ(xi)j=1∑ncjK(xi,xj))=λi=1∑nciϕ(xi)=λi=1∑nciϕ(xi)=nλi=1∑nciϕ(xi)
注意,此处
K
=
K
^
\mathbf{K}=\hat{\mathbf{K}}
K=K^ 已经中心化
对于
∀
k
(
1
≤
k
≤
n
)
\forall k (1\le k\le n)
∀k(1≤k≤n),两边同时左乘
ϕ
(
x
k
)
\phi(\mathbf{x}_{k})
ϕ(xk):
∑
i
=
1
n
(
ϕ
T
(
x
k
)
ϕ
(
x
i
)
∑
j
=
1
n
c
j
K
(
x
i
,
x
j
)
)
=
n
λ
∑
i
=
1
n
c
i
ϕ
T
(
x
k
)
ϕ
(
x
i
)
∑
i
=
1
n
(
K
(
x
k
,
x
i
)
∑
j
=
1
n
c
j
K
(
x
i
,
x
j
)
)
=
n
λ
∑
i
=
1
n
c
i
K
(
x
k
,
x
i
)
\begin{aligned} \sum_{i=1}^{n}\left(\phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(K(\mathbf{x}_k, \mathbf{x}_i) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} K(\mathbf{x}_k, \mathbf{x}_i) \\ \end{aligned}
i=1∑n(ϕT(xk)ϕ(xi)j=1∑ncjK(xi,xj))i=1∑n(K(xk,xi)j=1∑ncjK(xi,xj))=nλi=1∑nciϕT(xk)ϕ(xi)=nλi=1∑nciK(xk,xi)
令
K
i
=
(
K
(
x
i
,
x
1
)
,
K
(
x
i
,
x
2
)
,
⋯
,
K
(
x
i
,
x
n
)
)
T
\mathbf{K}_{i}=\left(K\left(\mathbf{x}_{i}, \mathbf{x}_{1}\right), K\left(\mathbf{x}_{i}, \mathbf{x}_{2}\right), \cdots, K\left(\mathbf{x}_{i}, \mathbf{x}_{n}\right)\right)^{T}
Ki=(K(xi,x1),K(xi,x2),⋯,K(xi,xn))T (核矩阵的第
i
i
i 行,
K
=
(
[
K
1
T
⋮
K
n
T
]
)
\mathbf{K}=(\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix})
K=(⎣⎢⎡K1T⋮KnT⎦⎥⎤)),
c
=
(
c
1
,
c
2
,
⋯
,
c
n
)
T
\mathbf{c}=(c_1,c_2,\cdots,c_n)^T
c=(c1,c2,⋯,cn)T,则:
∑
i
=
1
n
K
(
x
k
,
x
i
)
K
i
T
c
=
n
λ
K
k
T
c
,
k
=
1
,
2
,
⋯
,
n
K
k
T
[
K
1
T
⋮
K
n
T
]
c
=
n
λ
K
k
T
c
K
k
T
K
=
n
λ
K
k
T
c
\begin{aligned} \sum_{i=1}^{n}K(\mathbf{x}_k, \mathbf{x}_i) \mathbf{K}^T_i\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c},k=1,2,\cdots,n \\ \mathbf{K}^T_k\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix}\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c}\\ \mathbf{K}^T_k\mathbf{K} &=n \lambda \mathbf{K}^T_k\mathbf{c} \end{aligned}
i=1∑nK(xk,xi)KiTcKkT⎣⎢⎡K1T⋮KnT⎦⎥⎤cKkTK=nλKkTc,k=1,2,⋯,n=nλKkTc=nλKkTc
即
K
2
c
=
n
λ
K
c
\mathbf{K}^2\mathbf{c}=n\lambda \mathbf{K}\mathbf{c}
K2c=nλKc
假设
K
−
1
\mathbf{K}^{-1}
K−1 存在
K
2
c
=
n
λ
K
c
K
c
=
n
λ
c
K
c
=
η
c
,
η
=
n
λ
\begin{aligned} \mathbf{K}^2\mathbf{c}&=n\lambda \mathbf{K}\mathbf{c}\\ \mathbf{K}\mathbf{c}&=n\lambda \mathbf{c}\\ \mathbf{K}\mathbf{c}&= \eta\mathbf{c},\eta=n\lambda \end{aligned}
K2cKcKc=nλKc=nλc=ηc,η=nλ
结论:
η
1
n
≥
η
2
n
≥
⋯
≥
η
n
n
\frac{\eta_1}{n}\ge\frac{\eta_2}{n}\ge\cdots\ge\frac{\eta_n}{n}
nη1≥nη2≥⋯≥nηn,给出在特征空间中
ϕ
(
x
1
)
,
ϕ
(
x
2
)
,
⋯
,
ϕ
(
x
n
)
\phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n)
ϕ(x1),ϕ(x2),⋯,ϕ(xn) 的投影方差:
∑
i
=
1
r
η
r
n
\sum\limits_{i=1}^{r}\frac{\eta_r}{n}
i=1∑rnηr,其中
η
1
≥
η
2
⋯
≥
η
n
\eta_1\ge\eta_2\cdots\ge\eta_n
η1≥η2⋯≥ηn 是
K
\mathbf{K}
K 的特征值。
问:可否计算出 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯ , ϕ ( x n ) \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n) ϕ(x1),ϕ(x2),⋯,ϕ(xn) 在主元方向上的投影(即降维之后的数据)?
设
u
1
,
⋯
,
u
d
\mathbf{u}_1,\cdots,\mathbf{u}_d
u1,⋯,ud 是
Σ
ϕ
\mathbf{\Sigma}_{\phi}
Σϕ 的特征向量,则
ϕ
(
x
j
)
=
a
1
u
1
+
⋯
+
a
d
u
d
\phi(\mathbf{x}_j)=a_1\mathbf{u}_1+\cdots+a_d\mathbf{u}_d
ϕ(xj)=a1u1+⋯+adud,其中
a
k
=
ϕ
(
x
j
)
T
u
k
,
k
=
1
,
2
,
⋯
,
d
=
ϕ
(
x
j
)
T
∑
i
=
1
n
c
k
i
ϕ
(
x
i
)
=
∑
i
=
1
n
c
k
i
ϕ
(
x
j
)
T
ϕ
(
x
i
)
=
∑
i
=
1
n
c
k
i
K
(
x
j
,
x
i
)
\begin{aligned} a_k &= \phi(\mathbf{x}_j)^T\mathbf{u}_k, k=1,2,\cdots,d\\ &= \phi(\mathbf{x}_j)^T\sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_j)^T\phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} K(\mathbf{x}_j,\mathbf{x}_i) \end{aligned}
ak=ϕ(xj)Tuk,k=1,2,⋯,d=ϕ(xj)Ti=1∑nckiϕ(xi)=i=1∑nckiϕ(xj)Tϕ(xi)=i=1∑nckiK(xj,xi)
算法7.2:核主元分析( F ⊆ R d \mathcal{F}\subseteq \mathbb{R}^d F⊆Rd)
输入: K K K, α \alpha α
输出: A A A (降维后数据的投影坐标)
-
K ^ : = ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) \hat{\mathbf{K}} :=\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \mathbf{K}\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) K^:=(I−n11n×n)K(I−n11n×n)
-
η 1 , η 2 , ⋯ η d \eta_1,\eta_2,\cdots\eta_d η1,η2,⋯ηd ⟵ K \longleftarrow \mathbf{K} ⟵K 的特征值,只取前 d d d 个
-
c 1 , c 2 , ⋯ , c d \mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_d c1,c2,⋯,cd ⟵ K \longleftarrow \mathbf{K} ⟵K 的特征向量(单位化,正交)
-
c i ← 1 η i ⋅ c i , i = 1 , ⋯ , d \mathbf{c}_i \leftarrow \frac{1}{\sqrt{\eta_i}}\cdot \mathbf{c}_i,i=1,\cdots,d ci←ηi1⋅ci,i=1,⋯,d
-
选取最小的 r r r 使得: ∑ i = 1 r η i n ∑ i = 1 d η i n ≥ α \frac{\sum\limits_{i=1}^r\frac{\eta_i}{n}}{\sum\limits_{i=1}^d\frac{\eta_i}{n}}\ge \alpha i=1∑dnηii=1∑rnηi≥α
-
C r = ( c 1 , c 2 , ⋯ , c r ) \mathbf{C}_r=(\mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_r) Cr=(c1,c2,⋯,cr)
-
A = { a i ∣ a i = C r T K i , i = 1 , ⋯ , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{C}_r^T\mathbf{K}_i, i=1,\cdots,n\} A={ai∣ai=CrTKi,i=1,⋯,n}
Chapter 14:Hierarchical Clustering 分层聚类
14.1 预备
Def.1 给定数据集 D = { x 1 , x 2 , ⋯ , x n } , ( x i ∈ R d ) \mathbf{D}=\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\},(\mathbf{x}_i\in \mathbb{R}^d) D={x1,x2,⋯,xn},(xi∈Rd), D \mathbf{D} D 的一个聚类是指 D \mathbf{D} D 的划分 C = { C 1 , C 2 , ⋯ , C k } \mathcal{C}=\{C_1,C_2,\cdots,C_k \} C={C1,C2,⋯,Ck} s.t. C i ⊆ D , C i ∩ C j = ∅ , ∪ i = 1 k C i = D C_i\subseteq \mathbf{D},C_i \cap C_j=\emptyset, \cup_{i=1}^k C_i=\mathbf{D} Ci⊆D,Ci∩Cj=∅,∪i=1kCi=D;
称聚类 A = { A 1 , ⋯ , A r } \mathcal{A}=\{A_1,\cdots,A_r\} A={A1,⋯,Ar} 是聚类 B = { B 1 , ⋯ , B s } \mathcal{B}=\{B_1,\cdots,B_s\} B={B1,⋯,Bs} 的嵌套,如果 r > s r>s r>s,且对于 ∀ A i ∈ A \forall A_i \in \mathcal{A} ∀Ai∈A ,存在 B j ∈ B B_j \in \mathcal{B} Bj∈B 使得 A i ⊆ B j A_i \subseteq B_j Ai⊆Bj
D \mathbf{D} D 的分层聚类是指一个嵌套聚类序列 C 1 , ⋯ , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,⋯,Cn,其中 C 1 = { { x 1 } , { x 2 } , ⋯ , { x n } } , ⋯ , C n = { { x 1 , x 2 , ⋯ , x n } } \mathcal{C}_1=\{ \{\mathbf{x}_1\},\{\mathbf{x}_2\},\cdots,\{\mathbf{x}_n\}\},\cdots,\mathcal{C}_n=\{\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\} \} C1={{x1},{x2},⋯,{xn}},⋯,Cn={{x1,x2,⋯,xn}},且 C t \mathcal{C}_t Ct 是 C t + 1 \mathcal{C}_{t+1} Ct+1 的嵌套。
Def.2 分层聚类示图的顶点集是指所有在 C 1 , ⋯ , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,⋯,Cn 中出现的元,如果 C i ∈ C t C_i \in \mathcal{C}_t Ci∈Ct 且 C j ∈ C t + 1 C_j \in \mathcal{C}_{t+1} Cj∈Ct+1 满足,则 C i C_i Ci 与 C j C_j Cj 之间有一条边。
事实:
- 分层聚类示图是一棵二叉树(不一定,作为假设,假设每层只聚两类),分层聚类与其示图一一对应。
- 设(即数据点数为 n n n),则所有可能的分层聚类示图数目为 ( 2 n − 3 ) ! ! (2n-3)!! (2n−3)!! (跃乘 1 × 3 × 5 × ⋯ 1\times 3 \times 5 \times \cdots 1×3×5×⋯)
14.2 团聚分层聚类
算法14.1 :
输入: D , k \mathbf{D}, k D,k
输出: C \mathcal{C} C
- C ← { C i = { x i } ∣ x i ∈ D } \mathcal{C} \leftarrow \{C_i=\{\mathbf{x}_i\}|\mathbf{x}_i \in \mathbf{D} \} C←{Ci={xi}∣xi∈D}
- Δ ← { δ ( x i , x j ) : x i , x j ∈ D } \Delta \leftarrow \{\delta(\mathbf{x}_i,\mathbf{x}_j):\mathbf{x}_i,\mathbf{x}_j \in \mathbf{D} \} Δ←{δ(xi,xj):xi,xj∈D}
- repeat
- 寻找最近的对 C i , C j ∈ C C_i,C_j \in \mathcal{C} Ci,Cj∈C
- C i j ← C i ∪ C j C_{ij}\leftarrow C_i \cup C_j Cij←Ci∪Cj
- C ← ( C ∣ { C i , C j } ) ∪ C i j \mathcal{C}\leftarrow (\mathcal{C} | \{C_i,C_j \}) \cup {C_{ij}} C←(C∣{Ci,Cj})∪Cij
- 根据 C \mathcal{C} C 更新距离矩阵 Δ \Delta Δ
- Until ∣ C ∣ = k |\mathcal{C}|=k ∣C∣=k
问题:如何定义/计算 C i , C j C_i,C_j Ci,Cj 的距离,即 δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) ?
δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) 有以下五种不同方式:
-
简单连接: δ ( C i , C j ) : = min { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \min \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=min{δ(x,y)∣x∈Ci,y∈Cj}
-
完全连接: δ ( C i , C j ) : = max { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \max \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=max{δ(x,y)∣x∈Ci,y∈Cj}
-
组群平均: δ ( C i , C j ) : = ∑ x ∈ C i ∑ y ∈ C j δ ( x , y ) n i ⋅ n j , n i = ∣ C i ∣ , n j = ∣ C j ∣ \delta(C_i,C_j):= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_j}\delta(\mathbf{x},\mathbf{y})}{n_i \cdot n_j}, n_i=|C_i|,n_j=|C_j| δ(Ci,Cj):=ni⋅njx∈Ci∑y∈Cj∑δ(x,y),ni=∣Ci∣,nj=∣Cj∣
-
均值距离: δ ( C i , C j ) : = ∣ ∣ μ i − μ j ∣ ∣ 2 , μ i = 1 n ∑ x ∈ C i x , μ j = 1 n ∑ y ∈ C j y \delta(C_i,C_j):= ||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2,\boldsymbol{\mu}_i=\frac{1}{n}\sum\limits_{\mathbf{x} \in C_i}\mathbf{x},\boldsymbol{\mu}_j=\frac{1}{n}\sum\limits_{\mathbf{y} \in C_j}\mathbf{y} δ(Ci,Cj):=∣∣μi−μj∣∣2,μi=n1x∈Ci∑x,μj=n1y∈Cj∑y
-
极小方差:对任意 C i C_i Ci,定义平方误差和 S S E i = ∑ x ∈ C i ∣ ∣ x − μ i ∣ ∣ 2 SSE_i= \sum\limits_{\mathbf{x} \in C_i} ||\mathbf{x}-\boldsymbol{\mu}_i|| ^2 SSEi=x∈Ci∑∣∣x−μi∣∣2
对 C i , C j , S S E i j : = ∑ x ∈ C i ∪ C j ∣ ∣ x − μ i j ∣ ∣ 2 C_i,C_j,SSE_{ij}:=\sum\limits_{\mathbf{x} \in C_i\cup C_j} ||\mathbf{x}-\boldsymbol{\mu}_{ij}|| ^2 Ci,Cj,SSEij:=x∈Ci∪Cj∑∣∣x−μij∣∣2,其中 μ i j : = 1 n i + n j ∑ x ∈ C i ∪ C j x \boldsymbol{\mu}_{ij}:=\frac{1}{n_i+n_j}\sum\limits_{\mathbf{x} \in C_i\cup C_j}\mathbf{x} μij:=ni+nj1x∈Ci∪Cj∑x
δ ( C i , C j ) : = S S E i j − S S E i − S S E j \delta(C_i,C_j):=SSE_{ij}-SSE_i-SSE_j δ(Ci,Cj):=SSEij−SSEi−SSEj
证明: δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μi−μj∣∣2
简记: C i j : = C i ∪ C j , n i j : = n i + n j C_{ij}:=C_i\cup C_j,n_{ij}:=n_i+n_j Cij:=Ci∪Cj,nij:=ni+nj
注意:
C
i
∩
C
j
=
∅
C_i \cap C_j=\emptyset
Ci∩Cj=∅,故
∣
C
i
j
∣
=
n
i
+
n
j
|C_{ij}|=n_i+n_j
∣Cij∣=ni+nj
δ
(
C
i
,
C
j
)
=
∑
z
∈
C
i
j
∥
z
−
μ
i
j
∥
2
−
∑
x
∈
C
i
∥
x
−
μ
i
∥
2
−
∑
y
∈
C
j
∥
y
−
μ
j
∥
2
=
∑
z
∈
C
i
j
z
T
z
−
n
i
j
μ
i
j
T
μ
i
j
−
∑
x
∈
C
i
x
T
x
+
n
i
μ
i
T
μ
i
−
∑
y
∈
C
j
y
T
y
+
n
j
μ
j
T
μ
j
=
n
i
μ
i
T
μ
i
+
n
j
μ
j
T
μ
j
−
(
n
i
+
n
j
)
μ
i
j
T
μ
i
j
\begin{aligned} \delta\left(C_{i}, C_{j}\right) &=\sum_{\mathbf{z} \in C_{i j}}\left\|\mathbf{z}-\boldsymbol{\mu}_{i j}\right\|^{2}-\sum_{\mathbf{x} \in C_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}-\sum_{\mathbf{y} \in C_{j}}\left\|\mathbf{y}-\boldsymbol{\mu}_{j}\right\|^{2} \\ &=\sum_{\mathbf{z} \in C_{i j}} \mathbf{z}^{T} \mathbf{z}-n_{i j} \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}-\sum_{\mathbf{x} \in C_{i}} \mathbf{x}^{T} \mathbf{x}+n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-\sum_{\mathbf{y} \in C_{j}} \mathbf{y}^{T} \mathbf{y}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j} \\ &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j} \end{aligned}
δ(Ci,Cj)=z∈Cij∑∥∥z−μij∥∥2−x∈Ci∑∥x−μi∥2−y∈Cj∑∥∥y−μj∥∥2=z∈Cij∑zTz−nijμijTμij−x∈Ci∑xTx+niμiTμi−y∈Cj∑yTy+njμjTμj=niμiTμi+njμjTμj−(ni+nj)μijTμij
注意到:
μ
i
j
=
1
n
i
j
∑
z
∈
C
i
j
z
=
1
n
i
+
n
j
(
∑
x
∈
C
i
x
+
∑
y
∈
C
j
y
)
=
1
n
i
+
n
j
(
n
i
μ
i
+
n
j
μ
j
)
\boldsymbol{\mu}_{i j}=\frac{1}{n_{ij}}\sum\limits_{\mathbf{z} \in C_{ij}} \mathbf{z}=\frac{1}{n_i+n_j}(\sum\limits_{\mathbf{x} \in C_{i}} \mathbf{x}+\sum\limits_{\mathbf{y} \in C_{j}} \mathbf{y})=\frac{1}{n_i+n_j}(n_i\boldsymbol{\mu}_{i}+n_j\boldsymbol{\mu}_{j})
μij=nij1z∈Cij∑z=ni+nj1(x∈Ci∑x+y∈Cj∑y)=ni+nj1(niμi+njμj)
故:
μ
i
j
T
μ
i
j
=
1
(
n
i
+
n
j
)
2
(
n
i
2
μ
i
T
μ
i
+
2
n
i
n
j
μ
i
T
μ
j
+
n
j
2
μ
j
T
μ
j
)
\boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}=\frac{1}{\left(n_{i}+n_{j}\right)^{2}}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right)
μijTμij=(ni+nj)21(ni2μiTμi+2ninjμiTμj+nj2μjTμj)
δ
(
C
i
,
C
j
)
=
n
i
μ
i
T
μ
i
+
n
j
μ
j
T
μ
j
−
1
(
n
i
+
n
j
)
(
n
i
2
μ
i
T
μ
i
+
2
n
i
n
j
μ
i
T
μ
j
+
n
j
2
μ
j
T
μ
j
)
=
n
i
(
n
i
+
n
j
)
μ
i
T
μ
i
+
n
j
(
n
i
+
n
j
)
μ
j
T
μ
j
−
n
i
2
μ
i
T
μ
i
−
2
n
i
n
j
μ
i
T
μ
j
−
n
j
2
μ
j
T
μ
j
n
i
+
n
j
=
n
i
n
j
(
μ
i
T
μ
i
−
2
μ
i
T
μ
j
+
μ
j
T
μ
j
)
n
i
+
n
j
=
(
n
i
n
j
n
i
+
n
j
)
∥
μ
i
−
μ
j
∥
2
\begin{aligned} \delta\left(C_{i}, C_{j}\right) &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \mu_{j}^{T} \boldsymbol{\mu}_{j}-\frac{1}{\left(n_{i}+n_{j}\right)}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) \\ &=\frac{n_{i}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}-n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}}{n_{i}+n_{j}} \\ &=\frac{n_{i} n_{j}\left(\boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+\boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right)}{n_{i}+n_{j}} \\ &=\left(\frac{n_{i} n_{j}}{n_{i}+n_{j}}\right)\left\|\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right\|^{2} \end{aligned}
δ(Ci,Cj)=niμiTμi+njμjTμj−(ni+nj)1(ni2μiTμi+2ninjμiTμj+nj2μjTμj)=ni+njni(ni+nj)μiTμi+nj(ni+nj)μjTμj−ni2μiTμi−2ninjμiTμj−nj2μjTμj=ni+njninj(μiTμi−2μiTμj+μjTμj)=(ni+njninj)∥∥μi−μj∥∥2
问题:如何快速计算算法14.1 第7步:更新矩阵?
☆ Lance–Williams formula
δ
(
C
i
j
,
C
r
)
=
α
i
⋅
δ
(
C
i
,
C
r
)
+
α
j
⋅
δ
(
C
j
,
C
r
)
+
β
⋅
δ
(
C
i
,
C
j
)
+
γ
⋅
∣
δ
(
C
i
,
C
r
)
−
δ
(
C
j
,
C
r
)
∣
\begin{array}{r} \delta\left(C_{i j}, C_{r}\right)=\alpha_{i} \cdot \delta\left(C_{i}, C_{r}\right)+\alpha_{j} \cdot \delta\left(C_{j}, C_{r}\right)+ \\ \beta \cdot \delta\left(C_{i}, C_{j}\right)+\gamma \cdot\left|\delta\left(C_{i}, C_{r}\right)-\delta\left(C_{j}, C_{r}\right)\right| \end{array}
δ(Cij,Cr)=αi⋅δ(Ci,Cr)+αj⋅δ(Cj,Cr)+β⋅δ(Ci,Cj)+γ⋅∣δ(Ci,Cr)−δ(Cj,Cr)∣
Measure | α i \alpha_i αi | α j \alpha_j αj | β \beta β | γ \gamma γ |
---|---|---|---|---|
简单连接 | 1 2 1\over2 21 | 1 2 1\over2 21 | 0 0 0 | − 1 2 -{1\over2} −21 |
完全连接 | 1 2 1\over2 21 | 1 2 1\over2 21 | 0 0 0 | 1 2 1\over2 21 |
组群平均 | n i n i + n j \frac{n_i}{n_i+n_j} ni+njni | n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj | 0 0 0 | 0 0 0 |
均值距离 | n i n i + n j \frac{n_i}{n_i+n_j} ni+njni | n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj | − n i n j ( n i + n j ) 2 \frac{-n_in_j}{(n_i+n_j)^2} (ni+nj)2−ninj | 0 0 0 |
极小方差 | n i + n r n i + n j + n r \frac{n_i+n_r}{n_i+n_j+n_r} ni+nj+nrni+nr | n j + n r n i + n j + n r \frac{n_j+n_r}{n_i+n_j+n_r} ni+nj+nrnj+nr | − n r n i + n j + n r \frac{-n_r}{n_i+n_j+n_r} ni+nj+nr−nr | 0 0 0 |
Proof:
-
简单连接
δ ( C i j , C r ) = min { δ ( x , y ) ∣ x ∈ C i j , y ∈ C r } = min { δ ( C i , C r ) , δ ( C j , C r ) } a = a + b − ∣ a − b ∣ 2 , b = a + b + ∣ a − b ∣ 2 \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \min \{\delta({\mathbf{x}, \mathbf{y}} )|\mathbf{x}\in C_{ij}, \mathbf{y} \in C_r\} \\ &= \min \{\delta({C_{i}, C_{r}), \delta(C_{j}, C_{r}} )\} \end{aligned}\\ a=\frac{a+b-|a-b|}{2},b=\frac{a+b+|a-b|}{2} δ(Cij,Cr)=min{δ(x,y)∣x∈Cij,y∈Cr}=min{δ(Ci,Cr),δ(Cj,Cr)}a=2a+b−∣a−b∣,b=2a+b+∣a−b∣
-
完全连接
见上图
-
组群平均
δ ( C i j , C r ) = ∑ x ∈ C i ∪ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = ∑ x ∈ C i ∑ y ∈ C r δ ( x , y ) + ∑ x ∈ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = n i n r δ ( C i , C r ) + n j n r δ ( C j , C r ) ( n i + n j ) ⋅ n r = n i δ ( C i , C r ) + n j δ ( C j , C r ) ( n i + n j ) \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \frac{\sum\limits_{\mathbf{x} \in C_i\cup C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})+\sum\limits_{\mathbf{x} \in C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &=\frac{n_in_r\delta(C_i,C_r)+n_jn_r\delta(C_j,C_r)}{(n_i+n_j )\cdot n_r}\\ &=\frac{n_i\delta(C_i,C_r)+n_j\delta(C_j,C_r)}{(n_i+n_j )} \end{aligned} δ(Cij,Cr)=(ni+nj)⋅nrx∈Ci∪Cj∑y∈Cr∑δ(x,y)=(ni+nj)⋅nrx∈Ci∑y∈Cr∑δ(x,y)+x∈Cj∑y∈Cr∑δ(x,y)=(ni+nj)⋅nrninrδ(Ci,Cr)+njnrδ(Cj,Cr)=(ni+nj)niδ(Ci,Cr)+njδ(Cj,Cr) -
均值距离:作业
-
极小方差
基于均值距离的结论再代入 δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μi−μj∣∣2
事实:算法14.1 的复杂度为 O ( n 2 log n ) O(n^2\log n) O(n2logn)
Chapter 15:基于密度的聚类
适用数据类型:非凸,又称非凸聚类;K-means 适用于凸数据
15.1 DBSCAN 算法
- 定义记号: ∀ x ∈ R d , N ϵ ( x ) : = { y ∈ R d ∣ δ ( x − y ) ≤ ϵ } \forall \mathbf{x}\in \mathbb{R}^d,N_{\epsilon}(\mathbf{x}):=\{\mathbf{y}\in \mathbb{R}^d|\delta(\mathbf{x}-\mathbf{y})\le\epsilon \} ∀x∈Rd,Nϵ(x):={y∈Rd∣δ(x−y)≤ϵ},其中 δ ( x − y ) = ∣ ∣ x − y ∣ ∣ \delta(\mathbf{x}-\mathbf{y})=||\mathbf{x}-\mathbf{y}|| δ(x−y)=∣∣x−y∣∣ 欧式距离,其他距离也可。 D ⊆ R d \mathbf{D}\subseteq \mathbb{R}^d D⊆Rd
Def.1 设 m i n p t s ∈ N + minpts \in \mathbb{N}_+ minpts∈N+ 是用户定义的局部密度,如果 ∣ N ϵ ( x ) ∩ D ∣ ≥ m i n p t s |N_{\epsilon}(\mathbf{x})\cap\mathbf{D}|\ge minpts ∣Nϵ(x)∩D∣≥minpts ,则称 x \mathbf{x} x 是 D \mathbf{D} D 核心点;如果 ∣ N ϵ ( x ) ∩ D ∣ < m i n p t s |N_{\epsilon}(\mathbf{x})\cap\mathbf{D}|< minpts ∣Nϵ(x)∩D∣<minpts ,且 x ∈ N ϵ ( z ) \mathbf{x}\in N_{\epsilon}(\mathbf{z}) x∈Nϵ(z) ,其中 z \mathbf{z} z 是 D \mathbf{D} D 的核心点,则称 x \mathbf{x} x 是 D \mathbf{D} D 的边缘点;如果 x \mathbf{x} x 既不是核心点又不是边缘点,则称 x \mathbf{x} x 是 D \mathbf{D} D 的噪点。
Def.2 如果 x ∈ N ϵ ( y ) \mathbf{x}\in N_{\epsilon}(\mathbf{y}) x∈Nϵ(y) 且 y \mathbf{y} y 是核心点,则称 x \mathbf{x} x 到 y \mathbf{y} y 是直接密度可达的。如果存在点列 x 0 , x 1 , ⋯ , x l \mathbf{x}_0,\mathbf{x}_1,\cdots,\mathbf{x}_l x0,x1,⋯,xl,使得 x 0 = x , x l = y \mathbf{x}_0=\mathbf{x},\mathbf{x}_l=\mathbf{y} x0=x,xl=y,且 x i \mathbf{x}_{i} xi 到 x i − 1 \mathbf{x}_{i-1} xi−1 是直接密度可达,则称 x \mathbf{x} x 到 y \mathbf{y} y 是密度可达。
Def.3 如果存在 z ∈ D \mathbf{z}\in \mathbf{D} z∈D,使得 x \mathbf{x} x 和 y \mathbf{y} y 到 z \mathbf{z} z 都是密度可达的,称 x \mathbf{x} x 和 y \mathbf{y} y 是密度连通的。
Def.4 基于密度的聚类是指基数最大的密度连通集(即集合内任意两点都是密度连通)。
算法15.1 : DBSCAN ( O ( n 2 ) O(n^2) O(n2))
输入: D , ϵ , m i n p t s \mathbf{D}, \epsilon, minpts D,ϵ,minpts
输出: C , C o r e , B o r d e r , N o i s e \mathcal{C},Core,Border,Noise C,Core,Border,Noise
-
C o r e ← ∅ Core \leftarrow \emptyset Core←∅
-
对每一个 x i ∈ D \mathbf{x}_i\in \mathbf{D} xi∈D
2.1 计算 N ϵ ( x i ) ( ⊆ D ) N_\epsilon(\mathbf{x}_i)(\subseteq \mathbf{D}) Nϵ(xi)(⊆D)
2.2 i d ( x i ) ← ∅ id(\mathbf{x}_i)\leftarrow \emptyset id(xi)←∅
2.3 如果 N ϵ ( x i ) ≥ m i n p t s N_\epsilon(\mathbf{x}_i)\ge minpts Nϵ(xi)≥minpts,则 C o r e ← C o r e ∪ { x i } Core\leftarrow Core \cup \{ \mathbf{x}_i\} Core←Core∪{xi}
-
k ← 0 k\leftarrow 0 k←0
-
对每一个 x i ∈ C o r e , s . t . i d ( x i ) = ∅ \mathbf{x}_i\in Core, s.t.id(\mathbf{x}_i)= \emptyset xi∈Core,s.t.id(xi)=∅,执行
4.1 k ← k + 1 k\leftarrow k+1 k←k+1
4.2 i d ( x i ) ← k id(\mathbf{x}_i)\leftarrow k id(xi)←k
4.3 D e n s i t y C o n n e c t e d ( x i , k ) Density Connected (\mathbf{x}_i,k) DensityConnected(xi,k)
-
C ← { C i } i = 1 k \mathcal{C}\leftarrow \{ C_i\}_{i=1}^k C←{Ci}i=1k,其中 C i ← { x i ∈ D ∣ i d ( x i ) = i } C_i\leftarrow \{\mathbf{x}_i \in \mathbf{D} |id(\mathbf{x}_i)=i\} Ci←{xi∈D∣id(xi)=i}
-
N o i s e ← { x i ∈ D ∣ i d ( x i ) = ∅ } Noise \leftarrow \{\mathbf{x}_i \in \mathbf{D} |id(\mathbf{x}_i)=\emptyset\} Noise←{xi∈D∣id(xi)=∅}
-
B o r d e r ← D ∖ { C o r e ∪ N o i s e } Border\leftarrow \mathbf{D}\setminus \{Core\cup Noise \} Border←D∖{Core∪Noise}
-
return C , C o r e , B o r d e r , N o i s e \mathcal{C},Core,Border,Noise C,Core,Border,Noise
D e n s i t y C o n n e c t e d ( x i , k ) Density Connected (\mathbf{x}_i,k) DensityConnected(xi,k):
-
对于每一个 y ∈ N ϵ ( x ) ∖ x \mathbf{y} \in N_\epsilon(\mathbf{x}) \setminus {\mathbf{x}} y∈Nϵ(x)∖x
1.1 i d ( y ) ← k id(\mathbf{y})\leftarrow k id(y)←k
1.2 如果 y ∈ C o r e \mathbf{y}\in Core y∈Core,则 D e n s i t y C o n n e c t e d ( y , k ) Density Connected (\mathbf{y},k) DensityConnected(y,k)
Remark:DBSCAN 对 ε \varepsilon ε 敏感: ε \varepsilon ε 过小,稀疏的类可能被认作噪点; ε \varepsilon ε 过大,稠密的类可能无法区分。
15.2 密度估计函数(DEF)
∀ z ∈ R d \forall \mathbf{z}\in \mathbb{R}^d ∀z∈Rd,定义 K ( z ) = 1 ( 2 π ) d / 2 e − z T z 2 K(\mathbf{z})=\frac{1}{(2\pi)^{d/2}}e^{-\frac{\mathbf{z}^T\mathbf{z}}{2}} K(z)=(2π)d/21e−2zTz, ∀ x ∈ R d , f ^ ( x ) : = 1 n h d ∑ i = 1 n K ( x − x i h ) \forall \mathbf{x}\in \mathbb{R}^d,\hat{f}(\mathbf{x}):=\frac{1}{nh^d}\sum\limits_{i=1}^{n}K(\frac{\mathbf{x}-\mathbf{x}_i}{h}) ∀x∈Rd,f^(x):=nhd1i=1∑nK(hx−xi)
其中 h > 0 h>0 h>0 是用户指定的步长, { x 1 , ⋯ , x n } \{\mathbf{x}_1,\cdots,\mathbf{x}_n\} {x1,⋯,xn} 是给定的数据集
15.3 DENCLUE
Def.1 称 x ∗ ∈ R d \mathbf{x}^*\in \mathbb{R}^d x∗∈Rd 是密度吸引子,如果它决定概率密度函数 f f f 的一个局部最大值。(PDF一般未知)
称 x ∗ ∈ R d \mathbf{x}^*\in \mathbb{R}^d x∗∈Rd 是 x ∈ R d \mathbf{x}\in \mathbb{R}^d x∈Rd 的密度吸引子,如果存在 x 0 , x 1 , … , x m \mathbf{x}_0,\mathbf{x}_1,\dots,\mathbf{x}_m x0,x1,…,xm,使得 x 0 = x , ∣ ∣ x m − x ∗ ∣ ∣ ≤ ϵ \mathbf{x}_0=\mathbf{x},||\mathbf{x}_m-\mathbf{x}^*||\le\epsilon x0=x,∣∣xm−x∗∣∣≤ϵ,且 x t + 1 = x t + δ ⋅ ∇ f ^ ( x t ) ( 1 ) \mathbf{x}_{t+1}=\mathbf{x}_{t}+\delta \cdot \nabla \hat{f}(\mathbf{x}_{t})\quad (1) xt+1=xt+δ⋅∇f^(xt)(1)
其中 ϵ , δ > 0 \epsilon,\delta >0 ϵ,δ>0 是用户定义的误差及步长, f ^ \hat{f} f^ 是DEF。
- 更高效的迭代公式:
动机:当 x \mathbf{x} x 靠近 x ∗ \mathbf{x}^* x∗ 时,迭代公式(1)迭代效率低 ∇ f ^ ( x ∗ ) = 0 \nabla \hat{f}(\mathbf{x}^*)=0 ∇f^(x∗)=0
而 ∇ f ^ ( x ) = ∂ ∂ x f ^ ( x ) = 1 n h d ∑ i = 1 n ∂ ∂ x K ( x − x i h ) \nabla \hat{f}(\mathbf{x})=\frac{\partial}{\partial \mathbf{x}} \hat{f}(\mathbf{x})=\frac{1}{n h^{d}} \sum\limits_{i=1}^{n} \frac{\partial}{\partial \mathbf{x}} K\left(\frac{\mathbf{x}-\mathbf{x}_{i}}{h}\right) ∇f^(x)=∂x∂f^(x)=nhd1i=1∑n∂x∂K(hx−xi)
∂ ∂ x K ( z ) = ( 1 ( 2 π ) d / 2 exp { − z T z 2 } ) ⋅ − z ⋅ ∂ z ∂ x = K ( z ) ⋅ − z ⋅ ∂ z ∂ x \begin{aligned} \frac{\partial}{\partial \mathbf{x}} K(\mathbf{z}) &=\left(\frac{1}{(2 \pi)^{d / 2}} \exp \left\{-\frac{\mathbf{z}^{T} \mathbf{z}}{2}\right\}\right) \cdot-\mathbf{z} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{x}} \\ &=K(\mathbf{z}) \cdot-\mathbf{z} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{x}} \end{aligned} ∂x∂K(z)=((2π)d/21exp{−2zTz})⋅−z⋅∂x∂z=K(z)⋅−z⋅∂x∂z
将 z = x − x i h \mathbf{z}=\frac{\mathbf{x}-\mathbf{x}_i}{h} z=hx−xi 代入得: ∂ ∂ x K ( x − x i h ) = K ( x − x i h ) ⋅ ( x i − x h ) ⋅ ( 1 h ) \frac{\partial}{\partial \mathbf{x}} K\left(\frac{\mathbf{x}-\mathbf{x}_{i}}{h}\right)=K\left(\frac{\mathbf{x}-\mathbf{x}_{i}}{h}\right) \cdot\left(\frac{\mathbf{x}_{i}-\mathbf{x}}{h}\right) \cdot\left(\frac{1}{h}\right) ∂x∂K(hx−xi)=K(hx−xi)⋅(hxi−x)⋅(h1)
故有: ∇ f ^ ( x ) = 1 n h d + 2 ∑ i = 1 n K ( x − x i h ) ⋅ ( x i − x ) \nabla \hat{f}(\mathbf{x})=\frac{1}{n h^{d+2}} \sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}-\mathbf{x}_{i}}{h}\right) \cdot\left(\mathbf{x}_{i}-\mathbf{x}\right) ∇f^(x)=nhd+21i=1∑nK(hx−xi)⋅(xi−x)
则: 1 n h d + 2 ∑ i = 1 n K ( x ∗ − x i h ) ⋅ ( x i − x ∗ ) = 0 \frac{1}{n h^{d+2}} \sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}^*-\mathbf{x}_{i}}{h}\right) \cdot\left(\mathbf{x}_{i}-\mathbf{x}^*\right)=0 nhd+21i=1∑nK(hx∗−xi)⋅(xi−x∗)=0
故有: x ∗ = ∑ i = 1 n K ( x ∗ − x i h ) ⋅ x i ∑ i = 1 n K ( x ∗ − x i h ) ( 2 ) \mathbf{x}^*=\frac{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}^*-\mathbf{x}_{i}}{h}\right)\cdot \mathbf{x}_{i}}{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}^*-\mathbf{x}_{i}}{h}\right)}\quad (2) x∗=i=1∑nK(hx∗−xi)i=1∑nK(hx∗−xi)⋅xi(2)
由(1): x t + 1 − x t = δ ⋅ ∇ f ^ ( x t ) \mathbf{x}_{t+1}-\mathbf{x}_{t}=\delta \cdot \nabla \hat{f}(\mathbf{x}_{t}) xt+1−xt=δ⋅∇f^(xt),(靠近 x ∗ \mathbf{x}^* x∗ 时)近似有: x t + 1 − x t ≈ 0 \mathbf{x}_{t+1}-\mathbf{x}_{t}\approx0 xt+1−xt≈0
且: x t = ∑ i = 1 n K ( x t − x i h ) ⋅ x i ∑ i = 1 n K ( x t − x i h ) \mathbf{x}_t=\frac{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)\cdot \mathbf{x}_{i}}{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)} xt=i=1∑nK(hxt−xi)i=1∑nK(hxt−xi)⋅xi
故: x t + 1 = ∑ i = 1 n K ( x t − x i h ) ⋅ x i ∑ i = 1 n K ( x t − x i h ) \mathbf{x}_{t+1}=\frac{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)\cdot \mathbf{x}_{i}}{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)} xt+1=i=1∑nK(hxt−xi)i=1∑nK(hxt−xi)⋅xi
Def.2 称 C ⊆ D C\subseteq \mathbf{D} C⊆D 是基于密度的类,如果存在密度吸引子 x 1 ∗ , … , x m ∗ \mathbf{x}^*_1,\dots,\mathbf{x}^*_m x1∗,…,xm∗ s . t : s.t: s.t:
- ∀ x ∈ C \forall \mathbf{x}\in C ∀x∈C 都有某个 x i ∗ \mathbf{x}^*_i xi∗ 使得, x i ∗ \mathbf{x}^*_i xi∗ 是 x \mathbf{x} x 的密度吸引子;
- ∀ i , f ^ ( x i ∗ ) ≥ ξ \forall i,\hat{f}(\mathbf{x}^*_i)\ge \xi ∀i,f^(xi∗)≥ξ,其中 ξ \xi ξ 是用户指定的极小密度阈值;
- ∀ x i ∗ , x j ∗ \forall\mathbf{x}^*_i,\mathbf{x}^*_j ∀xi∗,xj∗ 都密度可达,即存在路径从 x i ∗ \mathbf{x}^*_i xi∗ 到 x j ∗ \mathbf{x}^*_j xj∗ 使得路径上所有点 y \mathbf{y} y 都有 f ^ ( y ) ≥ ξ \hat{f}(\mathbf{y})\ge\xi f^(y)≥ξ。
算法15.2 : DENCLUE 算法
输入: D , h , ξ , ϵ \mathbf{D},h,\xi,\epsilon D,h,ξ,ϵ
输出: C \mathcal{C} C (基于密度的聚类)
-
A ← ∅ \mathcal{A}\leftarrow\emptyset A←∅
-
对每一个 x ∈ D \mathbf{x}\in \mathbf{D} x∈D:
2.1 x ∗ ← F I N D A T T R A C T O R ( x , D , h , ξ , ϵ ) \mathbf{x}^* \leftarrow FINDATTRACTOR(\mathbf{x},\mathbf{D},h,\xi,\epsilon) x∗←FINDATTRACTOR(x,D,h,ξ,ϵ)
2.2 R ( x ∗ ) ← ∅ R(\mathbf{x}^*)\leftarrow\emptyset R(x∗)←∅
2.3 if f ^ ( x ∗ ) ≥ ξ \hat{f}(\mathbf{x}^*)\ge \xi f^(x∗)≥ξ then:
2.4 A ← A ∪ { x ∗ } \mathcal{A}\leftarrow \mathcal{A}\cup\{ \mathbf{x}^*\} A←A∪{x∗}
2.5 R ( x ∗ ) ← R ( x ∗ ) ∪ { x ∗ } R(\mathbf{x}^*)\leftarrow R(\mathbf{x}^*)\cup\{ \mathbf{x}^* \} R(x∗)←R(x∗)∪{x∗}
-
C ← { maximal C ⊆ A ∣ ∀ x i ∗ , x j ∗ ∈ C , 满 足 D e f 2 条 件 3 } \mathcal{C}\leftarrow\{\text{maximal}\ C \subseteq \mathcal{A}| \forall\mathbf{x}^*_i,\mathbf{x}^*_j \in C, 满足 Def \ 2 条件3 \} C←{maximal C⊆A∣∀xi∗,xj∗∈C,满足Def 2条件3}
-
∀ C ∈ C : \forall C \in \mathcal{C}: ∀C∈C:
4.1 对每一个 x ∗ ∈ C \mathbf{x}^*\in C x∗∈C,令 C ← C ∪ R ( x ∗ ) C\leftarrow C\cup R(\mathbf{x}^*) C←C∪R(x∗)
-
Return C \mathcal{C} C
F I N D A T T R A C T O R ( x , D , h , ξ , ϵ ) FINDATTRACTOR(\mathbf{x},\mathbf{D},h,\xi,\epsilon) FINDATTRACTOR(x,D,h,ξ,ϵ):
-
t ← 0 t\leftarrow 0 t←0
-
x t = x \mathbf{x}_{t}=\mathbf{x} xt=x
-
Repeat:
x t + 1 ← ∑ i = 1 n K ( x t − x i h ) ⋅ x i ∑ i = 1 n K ( x t − x i h ) \mathbf{x}_{t+1}\leftarrow\frac{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)\cdot \mathbf{x}_{i}}{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)} xt+1←i=1∑nK(hxt−xi)i=1∑nK(hxt−xi)⋅xi
t ← t + 1 t\leftarrow t+1 t←t+1
-
Until ∣ ∣ x t − x t − 1 ∣ ∣ < ϵ ||\mathbf{x}_{t}-\mathbf{x}_{t-1}||<\epsilon ∣∣xt−xt−1∣∣<ϵ
Chapter 20: Linear Discriminant Analysis
Set up: D = { ( x i , y i ) } i = 1 n \mathbf{D}=\{(\mathbf{x}_i,y_i) \}_{i=1}^n D={(xi,yi)}i=1n, 其中 y i = 1 , 2 y_i=1,2 yi=1,2(或 ± 1 \pm 1 ±1 等), D 1 = { x i ∣ y i = 1 } \mathbf{D}_1=\{\mathbf{x}_i|y_i=1 \} D1={xi∣yi=1}, D 2 = { x i ∣ y i = 2 } \mathbf{D}_2=\{\mathbf{x}_i|y_i=2 \} D2={xi∣yi=2}
Goal:寻找向量 w ∈ R d \mathbf{w}\in \mathbb{R}^d w∈Rd (代表直线方向)使得 D 1 , D 2 \mathbf{D}_1,\mathbf{D}_2 D1,D2 的“平均值”距离最大且“总方差”最小。
20.1 Normal LDA
设 w ∈ R d , w T w = 1 \mathbf{w} \in \mathbb{R}^d,\mathbf{w}^T\mathbf{w}=1 w∈Rd,wTw=1,则 x i \mathbf{x}_i xi 在 w \mathbf{w} w 方向上的投影为 x i ′ = ( w T x i w T u ) w = a i w , a i = w T x i \mathbf{x}_{i}^{\prime}=\left(\frac{\mathbf{w}^{T} \mathbf{x}_{i}}{\mathbf{w}^{T} \mathbf{u}}\right) \mathbf{w}=a_{i} \mathbf{w},a_{i}=\mathbf{w}^{T} \mathbf{x}_{i} xi′=(wTuwTxi)w=aiw,ai=wTxi
则
D
1
\mathbf{D}_1
D1 中数据在
w
\mathbf{w}
w 上的投影平均值为:(
∣
D
1
∣
=
n
1
|\mathbf{D}_1|=n_1
∣D1∣=n1)
m
1
:
=
1
n
1
∑
x
i
∈
D
1
a
i
=
μ
1
T
w
m_1:=\frac{1}{n_1}\sum\limits_{\mathbf{x}_i\in \mathbf{D}_1}a_i=\boldsymbol{\mu}_1^T\mathbf{w}
m1:=n11xi∈D1∑ai=μ1Tw
投影平均值等于平均值的投影。
类似地:
D
2
\mathbf{D}_2
D2 中数据在
w
\mathbf{w}
w 上的投影平均值为:
m
2
:
=
1
n
2
∑
x
i
∈
D
2
a
i
=
μ
2
T
w
m_2:=\frac{1}{n_2}\sum\limits_{\mathbf{x}_i\in \mathbf{D}_2}a_i=\boldsymbol{\mu}_2^T\mathbf{w}
m2:=n21xi∈D2∑ai=μ2Tw
目标之一:寻找
w
\mathbf{w}
w 使得
(
m
1
−
m
2
)
2
(m_1-m_2)^2
(m1−m2)2 最大。
对于
D
i
\mathbf{D}_i
Di,定义:
s
i
2
=
∑
x
k
∈
D
i
(
a
k
−
m
i
)
2
s_i^2=\sum\limits_{\mathbf{x}_k\in \mathbf{D}_i}(a_k-m_i)^2
si2=xk∈Di∑(ak−mi)2
注意:
s
i
2
=
n
i
σ
i
2
(
∣
D
i
∣
=
n
i
)
s_i^2=n_i\sigma^2_i\ (|D_i|=n_i)
si2=niσi2 (∣Di∣=ni)
Goal:Fisher LDA目标函数:
max
w
J
(
w
)
=
(
m
1
−
m
2
)
2
s
1
2
+
s
2
2
\max\limits_{\mathbf{w}}J(\mathbf{w})=\frac{(m_1-m_2)^2}{s_1^2+s_2^2}
wmaxJ(w)=s12+s22(m1−m2)2
注意:
J
(
w
)
=
J
(
w
1
,
w
2
,
⋯
,
w
d
)
J(\mathbf{w})=J(w_1,w_2,\cdots,w_d)
J(w)=J(w1,w2,⋯,wd)
(
m
1
−
m
2
)
2
=
(
w
T
(
μ
1
−
μ
2
)
)
2
=
w
T
(
(
μ
1
−
μ
2
)
(
μ
1
−
μ
2
)
T
)
w
=
w
T
B
w
\begin{aligned} \left(m_{1}-m_{2}\right)^{2} &=\left(\mathbf{w}^{T}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)\right)^{2} \\ &=\mathbf{w}^{T}\left(\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{T}\right) \mathbf{w} \\ &=\mathbf{w}^{T} \mathbf{B} \mathbf{w} \end{aligned}
(m1−m2)2=(wT(μ1−μ2))2=wT((μ1−μ2)(μ1−μ2)T)w=wTBw
B \mathbf{B} B 被称为类间扩散矩阵
s 1 2 = ∑ x i ∈ D 1 ( a i − m 1 ) 2 = ∑ x i ∈ D 1 ( w T x i − w T μ 1 ) 2 = ∑ x i ∈ D 1 ( w T ( x i − μ 1 ) ) 2 = w T ( ∑ x i ∈ D 1 ( x i − μ 1 ) ( x i − μ 1 ) T ) w = w T S 1 w \begin{aligned} s_{1}^{2} &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left(a_{i}-m_{1}\right)^{2} \\ &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left(\mathbf{w}^{T} \mathbf{x}_{i}-\mathbf{w}^{T} \boldsymbol{\mu}_{1}\right)^{2} \\ &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left(\mathbf{w}^{T}\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{1}\right)\right)^{2} \\ &=\mathbf{w}^{T}\left(\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{1}\right)\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{1}\right)^{T}\right) \mathbf{w} \\ &=\mathbf{w}^{T} \mathbf{S}_{1} \mathbf{w} \end{aligned} s12=xi∈D1∑(ai−m1)2=xi∈D1∑(wTxi−wTμ1)2=xi∈D1∑(wT(xi−μ1))2=wT(xi∈D1∑(xi−μ1)(xi−μ1)T)w=wTS1w
S 1 \mathbf{S}_{1} S1 被称为 D 1 \mathbf{D}_1 D1 的扩散矩阵 S 1 = n 1 Σ 1 \mathbf{S}_{1}=n_1\Sigma_1 S1=n1Σ1
类似地, s 2 2 = w T S 2 w s_{2}^{2}=\mathbf{w}^{T} \mathbf{S}_{2} \mathbf{w} s22=wTS2w
令
S
=
S
1
+
S
2
\mathbf{S}=\mathbf{S}_{1}+\mathbf{S}_{2}
S=S1+S2,则
max
w
J
(
w
)
=
(
m
1
−
m
2
)
2
s
1
2
+
s
2
2
=
w
T
B
w
w
T
S
w
\max\limits_{\mathbf{w}}J(\mathbf{w})=\frac{(m_1-m_2)^2}{s_1^2+s_2^2}=\frac{\mathbf{w}^{T} \mathbf{B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S} \mathbf{w}}
wmaxJ(w)=s12+s22(m1−m2)2=wTSwwTBw
注意:
d
d
w
J
(
w
)
=
2
B
w
(
w
T
S
w
)
−
2
S
w
(
w
T
B
w
)
(
w
T
S
w
)
2
=
0
\frac{d}{d\mathbf{w}}J(\mathbf{w})=\frac{2\mathbf{B}\mathbf{w}(\mathbf{w}^T\mathbf{S}\mathbf{w})-2\mathbf{S}\mathbf{w}(\mathbf{w}^T\mathbf{B}\mathbf{w})}{(\mathbf{w}^T\mathbf{S}\mathbf{w})^2}=\mathbf{0}
dwdJ(w)=(wTSw)22Bw(wTSw)−2Sw(wTBw)=0
即有:
B
w
(
w
T
S
w
)
=
S
w
(
w
T
B
w
)
B
w
=
S
w
⋅
w
T
B
w
w
T
S
w
B
w
=
J
(
w
)
⋅
S
w
(
∗
)
\begin{aligned} \mathbf{B}\mathbf{w}(\mathbf{w}^T\mathbf{S}\mathbf{w})&=\mathbf{S}\mathbf{w}(\mathbf{w}^T\mathbf{B}\mathbf{w})\\ \mathbf{B}\mathbf{w}&=\mathbf{S}\mathbf{w}\cdot\frac{\mathbf{w}^{T} \mathbf{B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S} \mathbf{w}}\\ \mathbf{B}\mathbf{w}&=J(\mathbf{w})\cdot\mathbf{S} \mathbf{w}\quad (*) \end{aligned}
Bw(wTSw)BwBw=Sw(wTBw)=Sw⋅wTSwwTBw=J(w)⋅Sw(∗)
若
S
−
1
\mathbf{S}^{-1}
S−1 存在,则
S
−
1
B
w
=
J
(
w
)
⋅
w
\mathbf{S}^{-1}\mathbf{B}\mathbf{w}=J(\mathbf{w})\cdot\mathbf{w}
S−1Bw=J(w)⋅w
故要求最大
J
(
w
)
J(\mathbf{w})
J(w) ,只需
S
−
1
B
\mathbf{S}^{-1}\mathbf{B}
S−1B 的最大特征值,
w
\mathbf{w}
w 为其特征向量。
☆ 不求特征向量求出 w \mathbf{w} w 的方法
将
B
=
(
μ
1
−
μ
2
)
(
μ
1
−
μ
2
)
T
\mathbf{B}=(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{T}
B=(μ1−μ2)(μ1−μ2)T 代入
(
∗
)
(*)
(∗) 得
(
μ
1
−
μ
2
)
(
μ
1
−
μ
2
)
T
w
=
J
(
w
)
⋅
S
w
S
−
1
(
μ
1
−
μ
2
)
[
(
μ
1
−
μ
2
)
T
w
J
(
w
)
]
=
w
\begin{aligned} (\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{T}\mathbf{w} &=J(\mathbf{w})\cdot\mathbf{S} \mathbf{w}\\ \mathbf{S}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})[\frac{(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{T}\mathbf{w}}{J(\mathbf{w})}]&=\mathbf{w} \end{aligned}
(μ1−μ2)(μ1−μ2)TwS−1(μ1−μ2)[J(w)(μ1−μ2)Tw]=J(w)⋅Sw=w
故只需计算
S
−
1
(
μ
1
−
μ
2
)
\mathbf{S}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})
S−1(μ1−μ2),再单位化。
20.2 Kernel LDA:
事实1:如果 ( S ϕ − 1 B ϕ ) w = λ w \left(\mathbf{S}_{\phi}^{-1} \mathbf{B}_{\phi}\right) \mathbf{w}=\lambda \mathbf{w} (Sϕ−1Bϕ)w=λw,那么 w = ∑ j = 1 n a j ϕ ( x j ) \mathbf{w}=\sum\limits_{j=1}^na_j\phi(\mathbf{x}_j) w=j=1∑najϕ(xj),证明见讲稿最后两页。
令 a = ( a 1 , ⋯ , a n ) T \mathbf{a}=(a_1,\cdots,a_n)^T a=(a1,⋯,an)T 是“事实1”中的向量。
下面将 max w J ( w ) = ( m 1 − m 2 ) 2 s 1 2 + s 2 2 = w T B ϕ w w T S ϕ w \max\limits_{\mathbf{w}}J(\mathbf{w})=\frac{(m_1-m_2)^2}{s_1^2+s_2^2}=\frac{\mathbf{w}^{T} \mathbf{B}_{\phi} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S}_{\phi} \mathbf{w}} wmaxJ(w)=s12+s22(m1−m2)2=wTSϕwwTBϕw 的问题转化为 max G ( a ) \max G(\mathbf{a}) maxG(a) s.t. 使用 K \mathbf{K} K 能求解。
注意到:
m
i
=
w
T
μ
i
ϕ
=
(
∑
j
=
1
n
a
j
ϕ
(
x
j
)
)
T
(
1
n
i
∑
x
i
∈
D
i
ϕ
(
x
k
)
)
=
1
n
i
∑
j
=
1
n
∑
x
k
∈
D
i
a
j
ϕ
(
x
j
)
T
ϕ
(
x
k
)
=
1
n
i
∑
j
=
1
n
∑
x
k
∈
D
i
a
j
K
(
x
j
,
x
k
)
=
a
T
m
i
\begin{aligned} m_{i}=\mathbf{w}^{T} \boldsymbol{\mu}_{i}^{\phi} &=\left(\sum_{j=1}^{n} a_{j} \phi\left(\mathbf{x}_{j}\right)\right)^{T}\left(\frac{1}{n_{i}} \sum_{\mathbf{x}_{i} \in \mathbf{D}_{i}} \phi\left(\mathbf{x}_{k}\right)\right) \\ &=\frac{1}{n_{i}} \sum_{j=1}^{n} \sum_{\mathbf{x}_{k} \in \mathbf{D}_{i}} a_{j} \phi\left(\mathbf{x}_{j}\right)^{T} \phi\left(\mathbf{x}_{k}\right) \\ &=\frac{1}{n_{i}} \sum_{j=1}^{n} \sum_{\mathbf{x}_{k} \in \mathbf{D}_{i}} a_{j} K\left(\mathbf{x}_{j}, \mathbf{x}_{k}\right) \\ &=\mathbf{a}^{T} \mathbf{m}_{i} \end{aligned}
mi=wTμiϕ=(j=1∑najϕ(xj))T(ni1xi∈Di∑ϕ(xk))=ni1j=1∑nxk∈Di∑ajϕ(xj)Tϕ(xk)=ni1j=1∑nxk∈Di∑ajK(xj,xk)=aTmi
其中,
m
i
=
1
n
i
(
∑
x
k
∈
D
i
K
(
x
1
,
x
k
)
∑
x
k
∈
D
i
K
(
x
2
,
x
k
)
⋮
∑
x
k
∈
D
i
K
(
x
n
,
x
k
)
)
n
×
1
\mathbf{m}_{i}=\frac{1}{n_{i}}\left(\begin{array}{c} \sum\limits_{\mathbf{x}_{k} \in \mathbf{D}_{i}} K\left(\mathbf{x}_{1}, \mathbf{x}_{k}\right) \\ \sum\limits_{\mathbf{x}_{k} \in \mathbf{D}_{i}} K\left(\mathbf{x}_{2}, \mathbf{x}_{k}\right) \\ \vdots \\ \sum\limits_{\mathbf{x}_{k} \in \mathbf{D}_{i}} K\left(\mathbf{x}_{n}, \mathbf{x}_{k}\right) \end{array}\right)_{n\times 1}
mi=ni1⎝⎜⎜⎜⎜⎜⎜⎛xk∈Di∑K(x1,xk)xk∈Di∑K(x2,xk)⋮xk∈Di∑K(xn,xk)⎠⎟⎟⎟⎟⎟⎟⎞n×1
故
(
m
1
−
m
2
)
2
=
(
w
T
μ
1
ϕ
−
w
T
μ
2
ϕ
)
2
=
(
a
T
m
1
−
a
T
m
2
)
2
=
a
T
(
m
1
−
m
2
)
(
m
1
−
m
2
)
T
a
=
a
T
M
a
\begin{aligned} \left(m_{1}-m_{2}\right)^{2} &=\left(\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}-\mathbf{w}^{T} \boldsymbol{\mu}_{2}^{\phi}\right)^{2} \\ &=\left(\mathbf{a}^{T} \mathbf{m}_{1}-\mathbf{a}^{T} \mathbf{m}_{2}\right)^{2} \\ &=\mathbf{a}^{T}\left(\mathbf{m}_{1}-\mathbf{m}_{2}\right)\left(\mathbf{m}_{1}-\mathbf{m}_{2}\right)^{T} \mathbf{a} \\ &=\mathbf{a}^{T} \mathbf{M a} \end{aligned}
(m1−m2)2=(wTμ1ϕ−wTμ2ϕ)2=(aTm1−aTm2)2=aT(m1−m2)(m1−m2)Ta=aTMa
(
M
\mathbf{M}
M 被称为核类间扩散矩阵)
s
1
2
=
∑
x
i
∈
D
1
∥
w
T
ϕ
(
x
i
)
−
w
T
μ
1
ϕ
∥
2
=
∑
x
i
∈
D
1
∥
w
T
ϕ
(
x
i
)
∥
2
−
2
∑
x
i
∈
D
1
w
T
ϕ
(
x
i
)
⋅
w
T
μ
1
ϕ
+
∑
x
i
∈
D
1
∥
w
T
μ
1
ϕ
∥
2
=
(
∑
x
i
∈
D
1
∥
∑
j
=
1
n
a
j
ϕ
(
x
j
)
T
ϕ
(
x
i
)
∥
2
)
−
2
⋅
n
1
⋅
∥
w
T
μ
1
ϕ
∥
2
+
n
1
⋅
∥
w
T
μ
1
ϕ
∥
2
=
(
∑
x
i
∈
D
1
a
T
K
i
K
i
T
a
)
−
n
1
⋅
a
T
m
1
m
1
T
a
=
a
T
(
(
∑
x
i
∈
D
1
K
i
K
i
T
)
−
n
1
m
1
m
1
T
)
a
=
a
T
N
1
a
\begin{aligned} s_{1}^{2} &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left\|\mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right)-\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}\right\|^{2} \\ &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left\|\mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right)\right\|^{2}-2 \sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}} \mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right) \cdot \mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}+\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left\|\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}\right\|^{2} \\ &=\left(\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left\|\sum_{j=1}^{n} a_{j} \phi\left(\mathbf{x}_{j}\right)^{T} \phi\left(\mathbf{x}_{i}\right)\right\|^{2}\right)-2 \cdot n_{1} \cdot\left\|\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}\right\|^{2}+n_{1} \cdot\left\|\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}\right\|^{2}\\ &=\left(\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}} \mathbf{a}^{T} \mathbf{K}_{i} \mathbf{K}_{i}^{T} \mathbf{a}\right)-n_{1} \cdot \mathbf{a}^{T} \mathbf{m}_{1} \mathbf{m}_{1}^{T} \mathbf{a}\\ &=\mathbf{a}^{T}\left(\left(\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}} \mathbf{K}_{i} \mathbf{K}_{i}^{T}\right)-n_{1} \mathbf{m}_{1} \mathbf{m}_{1}^{T}\right) \mathbf{a} \\ &=\mathbf{a}^{T} \mathbf{N}_{1} \mathbf{a} \end{aligned}
s12=xi∈D1∑∥∥∥wTϕ(xi)−wTμ1ϕ∥∥∥2=xi∈D1∑∥∥wTϕ(xi)∥∥2−2xi∈D1∑wTϕ(xi)⋅wTμ1ϕ+xi∈D1∑∥∥∥wTμ1ϕ∥∥∥2=⎝⎛xi∈D1∑∥∥∥∥∥j=1∑najϕ(xj)Tϕ(xi)∥∥∥∥∥2⎠⎞−2⋅n1⋅∥∥∥wTμ1ϕ∥∥∥2+n1⋅∥∥∥wTμ1ϕ∥∥∥2=(xi∈D1∑aTKiKiTa)−n1⋅aTm1m1Ta=aT((xi∈D1∑KiKiT)−n1m1m1T)a=aTN1a
类似地,令
N
2
=
(
∑
x
i
∈
D
2
K
i
K
i
T
−
n
2
m
2
m
2
T
)
\mathbf{N}_2=\left(\sum\limits_{\mathbf{x}_{i} \in \mathbf{D}_{2}} \mathbf{K}_{i} \mathbf{K}_{i}^{T}-n_{2} \mathbf{m}_{2} \mathbf{m}_{2}^{T}\right)
N2=(xi∈D2∑KiKiT−n2m2m2T)
则 s 1 2 + s 2 2 = a T ( N 1 + N 2 ) a = a T N a s_1^2+s_2^2=\mathbf{a}^{T} (\mathbf{N}_{1}+\mathbf{N}_{2}) \mathbf{a}=\mathbf{a}^{T}\mathbf{N} \mathbf{a} s12+s22=aT(N1+N2)a=aTNa
故: J ( w ) = a T M a a T N a : = G ( a ) J(\mathbf{w})=\frac{\mathbf{a}^{T}\mathbf{M} \mathbf{a}}{\mathbf{a}^{T}\mathbf{N} \mathbf{a}}:=G(\mathbf{a}) J(w)=aTNaaTMa:=G(a)
类似 20.1, M a = λ N a \mathbf{M} \mathbf{a}=\lambda\mathbf{N} \mathbf{a} Ma=λNa
-
若 N − 1 \mathbf{N} ^{-1} N−1 存在, N − 1 M a = λ a \mathbf{N}^{-1} \mathbf{M} \mathbf{a}=\lambda \mathbf{a} N−1Ma=λa, λ \lambda λ 取 N − 1 M \mathbf{N}^{-1} \mathbf{M} N−1M 的最大特征值, a \mathbf{a} a 是相应的特征向量。
-
若 N − 1 \mathbf{N} ^{-1} N−1 不存在,MATLAB 求广义逆
最后考查 w T w = 1 \mathbf{w}^T\mathbf{w}=1 wTw=1,即
(
∑
j
=
1
n
a
j
ϕ
(
x
j
)
)
T
(
∑
i
=
1
n
a
i
ϕ
(
x
i
)
)
=
1
∑
j
=
1
n
∑
i
=
1
n
a
j
a
i
ϕ
(
x
j
)
T
ϕ
(
x
i
)
=
1
∑
j
=
1
n
∑
i
=
1
n
a
j
a
i
K
(
x
i
,
x
j
)
=
1
a
T
K
a
=
1
\begin{aligned} (\sum\limits_{j=1}^na_j\phi(\mathbf{x}_j))^T(\sum\limits_{i=1}^na_i\phi(\mathbf{x}_i))&=1\\ \sum\limits_{j=1}^n\sum\limits_{i=1}^na_ja_i\phi(\mathbf{x}_j)^T\phi(\mathbf{x}_i)&=1\\ \sum\limits_{j=1}^n\sum\limits_{i=1}^na_ja_iK(\mathbf{x}_i,\mathbf{x}_j)&=1\\ \mathbf{a}^T\mathbf{K}\mathbf{a}&=1 \end{aligned}
(j=1∑najϕ(xj))T(i=1∑naiϕ(xi))j=1∑ni=1∑najaiϕ(xj)Tϕ(xi)j=1∑ni=1∑najaiK(xi,xj)aTKa=1=1=1=1
求出
N
−
1
M
\mathbf{N}^{-1} \mathbf{M}
N−1M 的特征向量
a
\mathbf{a}
a 后,
a
←
a
a
T
K
a
\mathbf{a}\leftarrow \frac{\mathbf{a}}{\sqrt{\mathbf{a}^T\mathbf{K}\mathbf{a}}}
a←aTKaa 以保证
w
T
w
=
1
\mathbf{w}^T\mathbf{w}=1
wTw=1
Chapter 21: Support Vector Machines (SVM)
21.1 支撑向量与余量
Set up: D = { ( x i , y i ) } i = 1 n , x i ∈ R d , y i ∈ { − 1 , 1 } \mathbf{D}=\{(\mathbf{x}_i,y_i) \}_{i=1}^n,\mathbf{x}_i \in \mathbb{R}^d,y_i \in \{-1,1 \} D={(xi,yi)}i=1n,xi∈Rd,yi∈{−1,1},仅两类数据。
-
超平面 (hyperplanes, d − 1 d-1 d−1 维): h ( x ) : = w T x + b = w 1 x 1 + ⋯ + w d x d + b h(\mathbf{x}):=\mathbf{w}^T\mathbf{x}+b=w_1x_1+ \cdots +w_dx_d+b h(x):=wTx+b=w1x1+⋯+wdxd+b
其中, w \mathbf{w} w 是法向量, − b w i -{b\over w_i} −wib 是 x i x_i xi 轴上的截距。
-
D \mathbf{D} D 称为是线性可分的,如果存在 h ( x ) h(\mathbf{x}) h(x) 使得对所有 y i = 1 y_i=1 yi=1 的点 x i \mathbf{x}_i xi 有 h ( x i ) > 0 h(\mathbf{x}_i)>0 h(xi)>0 ,且对所有 y i = − 1 y_i=-1 yi=−1 的点 x i \mathbf{x}_i xi 有 h ( x i ) < 0 h(\mathbf{x}_i)<0 h(xi)<0 ,并将此 h ( x ) h(\mathbf{x}) h(x) 称为分离超平面。
Remark:对于线性可分的 D \mathbf{D} D ,分离超平面有无穷多个。
-
点到超平面的距离:
x = x p + x r = x p + r ⋅ w ∣ ∣ w ∣ ∣ h ( x ) = h ( x p + r w ∥ w ∥ ) = w T ( x p + r w ∥ w ∥ ) + b = w T x p + b ⏟ h ( x p ) + r w T w ∥ w ∥ = h ( x p ) ⏟ 0 + r ∥ w ∥ = r ∥ w ∥ \mathbf{x}=\mathbf{x}_p+\mathbf{x}_r=\mathbf{x}_p+r\cdot\frac{\mathbf{w}}{||\mathbf{w}||}\\ \begin{aligned} h(\mathbf{x}) &=h\left(\mathbf{x}_{p}+r \frac{\mathbf{w}}{\|\mathbf{w}\|}\right) \\ &=\mathbf{w}^{T}\left(\mathbf{x}_{p}+r \frac{\mathbf{w}}{\|\mathbf{w}\|}\right)+b \\ &=\underbrace{\mathbf{w}^{T}\mathbf{x}_{p}+b}_{h\left(\mathbf{x}_{p}\right)}+r \frac{\mathbf{w}^{T} \mathbf{w}}{\|\mathbf{w}\|}\\ &=\underbrace{h\left(\mathbf{x}_{p}\right)}_{0}+r\|\mathbf{w}\| \\ &=r\|\mathbf{w}\| \end{aligned} x=xp+xr=xp+r⋅∣∣w∣∣wh(x)=h(xp+r∥w∥w)=wT(xp+r∥w∥w)+b=h(xp) wTxp+b+r∥w∥wTw=0 h(xp)+r∥w∥=r∥w∥∴ r = h ( x ) ∥ w ∥ , ∣ r ∣ = ∣ h ( x ∣ ) ∥ w ∥ \therefore r=\frac{h(\mathbf{x})}{\|\mathbf{w}\|},|r|=\frac{|h(\mathbf{x}|)}{\|\mathbf{w}\|} ∴r=∥w∥h(x),∣r∣=∥w∥∣h(x∣)
故 ∀ x i ∈ D \forall \mathbf{x}_i \in \mathbf{D} ∀xi∈D 到 h ( x ) h(\mathbf{x}) h(x) 的距离是 y i h ( x i ) ∥ w ∥ y_i\frac{h(\mathbf{x}_i)}{\|\mathbf{w}\|} yi∥w∥h(xi)
-
给定线性可分的 D \mathbf{D} D ,及分离超平面 h ( x ) h(\mathbf{x}) h(x) ,定义余量:
δ ∗ = min x i { y i ( w T x i + b ) ∥ w ∥ } \delta^*=\min\limits_{\mathbf{x}_i}\{\frac{y_i(\mathbf{w}^T\mathbf{x}_i+b)}{\|\mathbf{w}\|} \} δ∗=ximin{∥w∥yi(wTxi+b)}
即 D \mathbf{D} D 中点到 h ( x ) h(\mathbf{x}) h(x) 距离的最小值,使得该 δ ∗ \delta^* δ∗ 取到的数据点 x i \mathbf{x}_i xi 被称为支撑向量(可能不唯一)。 -
标准超平面:对 ∀ h ( x ) = w T x + b \forall h(\mathbf{x})=\mathbf{w}^T\mathbf{x}+b ∀h(x)=wTx+b,以及任意 s ∈ R ∖ { 0 } s\in \mathbb{R}\setminus \{0\} s∈R∖{0}, s ( w T x + b ) = 0 s(\mathbf{w}^T\mathbf{x}+b)=0 s(wTx+b)=0 与 h ( x ) = 0 h(\mathbf{x})=0 h(x)=0 是同一超平面。
设 x ∗ \mathbf{x}^* x∗ 是支撑向量,若 s y ∗ ( w T x ∗ + b ) = 1 ( 1 ) sy^*(\mathbf{w}^T\mathbf{x}^*+b)=1\ (1) sy∗(wTx∗+b)=1 (1) ,则称 s h ( x ) = 0 sh(\mathbf{x})=0 sh(x)=0 是标准超平面。
由 ( 1 ) (1) (1) 可得: s = 1 y ∗ ( w T x ∗ + b ) = 1 y ∗ h ( x ∗ ) s=\frac{1}{y^*(\mathbf{w}^T\mathbf{x}^*+b)}=\frac{1}{y^*h(\mathbf{x}^*)} s=y∗(wTx∗+b)1=y∗h(x∗)1
此时,对于 s h ( x ) = 0 sh(\mathbf{x})=0 sh(x)=0 ,余量 δ ∗ = y ∗ h ( x ∗ ) ∥ w ∥ = 1 ∥ w ∥ \delta^*=\frac{y^*h(\mathbf{x}^*)}{\|\mathbf{w}\|}=\frac{1}{\|\mathbf{w}\|} δ∗=∥w∥y∗h(x∗)=∥w∥1
事实:如果 w T x + b = 0 \mathbf{w}^T\mathbf{x}+b=0 wTx+b=0 是标准超平面,对 ∀ x i \forall \mathbf{x}_i ∀xi ,一定有 y i ( w T x i + b ) ≥ 1 y_i(\mathbf{w}^T\mathbf{x}_i+b)\ge1 yi(wTxi+b)≥1
21.2 SVM: 线性可分情形
目标:寻找标准分离超平面使得其余量最大,即 h ∗ = arg max w , b { 1 w } h^*=\arg\max\limits_{\mathbf{w},b}\{\frac{1}{\mathbf{w}} \} h∗=argw,bmax{w1}
转为优化问题:
min
w
,
b
{
∥
w
∥
2
2
}
s.t.
y
i
(
w
T
x
i
+
b
)
≥
1
,
∀
(
x
i
,
y
i
)
∈
D
\begin{aligned} &\min\limits_{\mathbf{w},b}\ \{\frac{\|\mathbf{w}\|^2}{2} \}\\ &\text{s.t.} \ y_i(\mathbf{w}^T\mathbf{x}_i+b)\ge1,\forall(\mathbf{x}_i,y_i)\in \mathbf{D} \end{aligned}
w,bmin {2∥w∥2}s.t. yi(wTxi+b)≥1,∀(xi,yi)∈D
引入 Lagrange 乘子
α
i
≥
0
\alpha_i\ge0
αi≥0 与 KKT 条件:
α
i
(
y
i
(
w
T
x
i
+
b
)
−
1
)
=
0
\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1)=0
αi(yi(wTxi+b)−1)=0
定义:
L
(
w
)
=
1
2
∥
w
∥
2
−
∑
i
=
1
n
α
i
(
y
i
(
w
T
x
i
+
b
)
−
1
)
(
2
)
L(\mathbf{w})=\frac{1}{2}\|\mathbf{w}\|^2-\sum\limits_{i=1}^{n}\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1)\ (2)
L(w)=21∥w∥2−i=1∑nαi(yi(wTxi+b)−1) (2)
∂ ∂ w L = w − ∑ i = 1 n α i y i x i = 0 ( 3 ) ∂ ∂ b L = ∑ i = 1 n α i y i = 0 ( 4 ) \begin{array}{l} \frac{\partial}{\partial \mathbf{w}} L=\mathbf{w}-\sum\limits_{i=1}^{n} \alpha_{i} y_{i} \mathbf{x}_{i}=\mathbf{0}\ (3) \\ \frac{\partial}{\partial b} L=\sum\limits_{i=1}^{n} \alpha_{i} y_{i}=0\ (4) \end{array} ∂w∂L=w−i=1∑nαiyixi=0 (3)∂b∂L=i=1∑nαiyi=0 (4)
将 (3)(4) 代入 (2) 得:
L
d
u
a
l
=
1
2
w
T
w
−
w
T
(
∑
i
=
1
n
α
i
y
i
x
i
⏟
w
)
−
b
∑
i
=
1
n
α
i
y
i
⏟
0
+
∑
i
=
1
n
α
i
=
−
1
2
w
T
w
+
∑
i
=
1
n
α
i
=
∑
i
=
1
n
α
i
−
1
2
∑
i
=
1
n
∑
j
=
1
n
α
i
α
j
y
i
y
j
x
i
T
x
j
\begin{aligned} L_{d u a l} &=\frac{1}{2} \mathbf{w}^{T} \mathbf{w}-\mathbf{w}^{T}(\underbrace{\sum_{i=1}^{n} \alpha_{i} y_{i} \mathbf{x}_{i}}_{\mathbf{w}})-b\underbrace{ \sum_{i=1}^{n} \alpha_{i} y_{i}}_{0}+\sum_{i=1}^{n} \alpha_{i}\\ &=-\frac{1}{2} \mathbf{w}^{T} \mathbf{w}+\sum_{i=1}^{n} \alpha_{i}\\ &=\sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} \mathbf{x}_{i}^{T} \mathbf{x}_{j} \end{aligned}
Ldual=21wTw−wT(w
i=1∑nαiyixi)−b0
i=1∑nαiyi+i=1∑nαi=−21wTw+i=1∑nαi=i=1∑nαi−21i=1∑nj=1∑nαiαjyiyjxiTxj
故对偶问题为:
max
α
L
d
u
a
l
=
∑
i
=
1
n
α
i
−
1
2
∑
i
=
1
n
∑
j
=
1
n
α
i
α
j
y
i
y
j
x
i
T
x
j
s.t.
∑
i
=
1
n
α
i
y
i
=
0
,
α
i
≥
0
,
∀
i
\begin{aligned} &\max\limits_{\alpha} L_{dual}=\sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} \mathbf{x}_{i}^{T} \mathbf{x}_{j}\\ &\text{s.t.}\ \sum\limits_{i=1}^{n} \alpha_{i} y_{i}=0,\alpha_i \ge0, \forall i \end{aligned}
αmaxLdual=i=1∑nαi−21i=1∑nj=1∑nαiαjyiyjxiTxjs.t. i=1∑nαiyi=0,αi≥0,∀i
利用二次规划解出 Dual:
α
1
,
⋯
,
α
n
\alpha_1,\cdots,\alpha_n
α1,⋯,αn
代入 (3) 可得: w = ∑ i = 1 n α i y i x i \mathbf{w}=\sum\limits_{i=1}^{n} \alpha_{i} y_{i} \mathbf{x}_{i} w=i=1∑nαiyixi
使得 α i > 0 \alpha_i>0 αi>0 的数据 x i \mathbf{x}_i xi 给出支撑向量。
对于每一个支撑向量: y i ( w T x i + b ) − 1 ⇒ b = 1 y i − w T x i y_i(\mathbf{w}^T\mathbf{x}_i+b)-1\Rightarrow b=\frac{1}{y_i}-\mathbf{w}^T\mathbf{x}_i yi(wTxi+b)−1⇒b=yi1−wTxi
取 b = a v g α i > 0 { b i } b=\mathop{avg}_{\alpha_i>0}\{b_i\} b=avgαi>0{bi}