数据挖掘与分析课程笔记

数据挖掘与分析课程笔记

  • 参考教材:Data Mining and Analysis : MOHAMMED J.ZAKI, WAGNER MEIRA JR.


Chapter 1 :准备

1.1 数据矩阵

Def.1. 数据矩阵是指一个 ( n × d ) (n\times d) (n×d) 的矩阵
D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D=x1x2xnX1x11x21xn1X2x12x22xn2Xdx1dx2dxnd
行:实体,列:属性

Ex. 鸢尾花数据矩阵
( 萼 片 长 萼 片 宽 花 瓣 长 花 瓣 宽 类 别 X 1 X 2 X 3 X 4 X 5 x 1 5.9 3.0 4.2 1.5 云 芝 ) \left(\begin{array}{c|ccccc} & 萼片长 & 萼片宽 & 花瓣长 & 花瓣宽 & 类别 \\ & X_{1} & X_{2} & X_{3} & X_{4} & X_{5} \\ \hline \mathbf{x}_{1} & 5.9 & 3.0 & 4.2 & 1.5 & 云芝 \\ \end{array}\right) x1X15.9X23.0X34.2X41.5X5

1.2 属性

Def.2.

  • 数值属性 是指取实数值(或整数值)的属性。
  • 若数值属性的取值范围是有限集或无限可数集,则称之为离散数值属性。若只有两种取值,则称为二元属性
  • 若数值属性的取值范围不是离散的则称为连续数值属性

Def.3. 类别属性 是指取值为符号的属性。

1.3 代数与几何的角度

假设 D \mathbf{D} D 中所有属性均为数值的,即
x i = ( x i 1 , x i 2 , … , x i d ) T ∈ R d , i = 1 , ⋯   , n \mathbf{x}_{i}=\left(x_{i 1}, x_{i 2}, \ldots, x_{i d}\right)^{T} \in \mathbb{R}^{d},i=1,\cdots,n xi=(xi1,xi2,,xid)TRd,i=1,,n

x j = ( x 1 j , x 2 j , … , x n j ) T ∈ R n , j = 1 , ⋯   , d \mathbf{x}_{j}=\left(x_{1 j}, x_{2j}, \ldots, x_{n j}\right)^{T} \in \mathbb{R}^{n},j=1,\cdots,d xj=(x1j,x2j,,xnj)TRn,j=1,,d
默认向量为列向量。

1.3.1 距离与角度

a , b ∈ R d \mathbf{a}, \mathbf{b} \in \mathbb{R}^{d} a,bRd

  • 点乘: a T b = ∑ i = 1 d a i b i \mathbf{a}^{T}\mathbf{b}=\sum\limits_{i=1}^{d} a_ib_i aTb=i=1daibi
  • 长度(欧氏范数): ∣ a ∣ = a T a = ∑ i = 1 d a i 2 \left | \mathbf{a} \right | =\sqrt{\mathbf{a}^{T}\mathbf{a} } =\sqrt{\sum\limits_{i=1}^{d} a_i^2} a=aTa =i=1dai2 ,单位化: a ∣ a ∣ \frac{\mathbf{a}}{|\mathbf{a}|} aa
  • 距离: δ ( a , b ) = ∣ ∣ a − b ∣ ∣ = ∑ i = 1 d ( a i − b i ) 2 \delta(\mathbf{a},\mathbf{b})=||\mathbf{a}-\mathbf{b}||=\sqrt{\sum\limits_{i=1}^{d}(a_i-b_i)^2} δ(a,b)=ab=i=1d(aibi)2
  • 角度: c o s θ = ( a ∣ a ∣ ) T ( b ∣ b ∣ ) cos \theta =(\frac{\mathbf{a}}{|\mathbf{a}|})^{T}(\frac{\mathbf{b}}{|\mathbf{b}|}) cosθ=(aa)T(bb),即单位化后作点乘
  • 正交: a \mathbf{a} a b \mathbf{b} b 正交,若 a T b = 0 \mathbf{a}^{T}\mathbf{b}=0 aTb=0

1.3.2 算术平均与总方差

Def.3.

  • 算术平均: m e a n ( D ) = μ ^ = 1 n ∑ i = 1 n x i , ∈ R d mean(\mathbf{D})=\hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^n\mathbf{x}_i,\in \mathbb{R}^{d} mean(D)=μ^=n1i=1nxi,Rd

  • 总方差: v a r ( D ) = 1 n ∑ i = 1 n δ ( x i , μ ^ ) 2 var(\mathbf{D})=\frac{1}{n} \sum\limits_{i=1}^{n} \delta\left(\mathbf{x}_{i}, \hat{\boldsymbol{\mu}}\right)^{2} var(D)=n1i=1nδ(xi,μ^)2

    自行验证: v a r ( D ) = 1 n ∑ i = 1 n ∣ ∣ x i − μ ^ ∣ ∣ 2 = 1 n ∑ i = 1 n ∣ ∣ x i ∣ ∣ 2 − ∣ ∣ μ ^ ∣ ∣ 2 var(\mathbf{D})=\frac{1}{n} \sum\limits_{i=1}^{n}||\mathbf{x}_{i}- \hat{\boldsymbol{\mu}}||^2=\frac{1}{n} \sum\limits_{i=1}^{n}||\mathbf{x}_{i}||^2-||\hat{\boldsymbol{\mu}}||^2 var(D)=n1i=1nxiμ^2=n1i=1nxi2μ^2

  • 中心数据矩阵: c e n t e r ( D ) = ( x 1 T − μ ^ T ⋮ x n T − μ ^ T ) center(\mathbf{D})=\begin{pmatrix} \mathbf{x}_{1}^T - \hat{\boldsymbol{\mu}}^T\\ \vdots \\ \mathbf{x}_{n}^T - \hat{\boldsymbol{\mu}}^T \end{pmatrix} center(D)=x1Tμ^TxnTμ^T

    显然 c e n t e r ( D ) center(\mathbf{D}) center(D) 的算术平均为 0 ∈ R d \mathbf{0}\in \mathbb{R}^{d} 0Rd

1.3.3 正交投影

Def.4. a , b ∈ R d \mathbf{a}, \mathbf{b} \in \mathbb{R}^{d} a,bRd,向量 b \mathbf{b} b 沿向量 a \mathbf{a} a 方向的正交分解是指,将 b \mathbf{b} b 写成: b = p + r \mathbf{b}= \mathbf{p}+ \mathbf{r} b=p+r。其中, p \mathbf{p} p 是指 b \mathbf{b} b a \mathbf{a} a 方向上的正交投影, r \mathbf{r} r 是指 a \mathbf{a} a b \mathbf{b} b 之间的垂直距离。

在这里插入图片描述

a ≠ 0 , b ≠ 0 \mathbf{a}\ne\mathbf{0},\mathbf{b}\ne\mathbf{0} a=0,b=0

p = c ⋅ a , ( c ≠ 0 , c ∈ R ) \mathbf{p}=c\cdot\mathbf{a},(c \ne 0,c \in \mathbb{R}) p=ca,(c=0,cR) r = b − p = b − c a \mathbf{r}=\mathbf{b}-\mathbf{p}=\mathbf{b}-c\mathbf{a} r=bp=bca

0 = p T r = ( c ⋅ a ) T ( b − c a ) = c ⋅ ( a T b − c ⋅ a T a ) 0 = \mathbf{p}^T\mathbf{r} = (c\cdot\mathbf{a})^T(\mathbf{b}-c\mathbf{a})=c\cdot(\mathbf{a}^T\mathbf{b}-c\cdot\mathbf{a}^T\mathbf{a}) 0=pTr=(ca)T(bca)=c(aTbcaTa)

c = a T b a T a , p = a T b a T a ⋅ a c= \frac{\mathbf{a}^T\mathbf{b}}{\mathbf{a}^T\mathbf{a}}, \mathbf{p}=\frac{\mathbf{a}^T\mathbf{b}}{\mathbf{a}^T\mathbf{a}}\cdot\mathbf{a} c=aTaaTb,p=aTaaTba

1.3.4 线性相关性与维数

皆与线性代数相同,自读。

1.4 概率观点

每一个数值属性 X X X 被视为一个随机变量,即 X : O → R X:\mathcal{O}\rightarrow \mathbb{R} X:OR

其中, O \mathcal{O} O 表示 X X X 的定义域,即所有实验可能输出的集合,即样本空间。 R \mathbb{R} R X X X 的值域,全体实数。

注:

  • 随机变量是一个函数。
  • O \mathcal{O} O 本身是数值的(即 O ⊆ R \mathcal{O}\subseteq \mathbb{R} OR,那么 X X X 是恒等函数,即 X ( v ) = v X(v)=v X(v)=v
  • X X X 的函数取值范围为有限集或无限可数集,则称之为离散随机变量,反之,为连续随机变量

Def.5. X X X 是离散的,那么 X X X 的概率质量函数(probability mass function, PMF)为:
∀ x ∈ R , f ( x ) = P ( X = x ) \forall x \in \mathbb{R},f(x)=P(X=x) xR,f(x)=P(X=x)
注: f ( x ) ≥ 0 , ∑ x f ( x ) = 1 f(x)\ge0,\sum\limits_xf(x)=1 f(x)0,xf(x)=1 f ( x ) = 0 f(x)=0 f(x)=0,如果 x ∉ x\notin x/ ( x x x 的值域)。

Def.6. X X X 是连续的,那么 X X X 的概率密度函数(probability density function, PDF)为:
P ( X ∈ [ a , b ] ) = ∫ a b f ( x ) d x P(X\in [a,b])=\int_{a}^{b} f(x)dx P(X[a,b])=abf(x)dx
f ( x ) ≥ 0 , ∫ − ∞ + ∞ f ( x ) = 1 f(x)\ge0,\int_{-\infty}^{+\infty}f(x)=1 f(x)0,+f(x)=1

Def.7. 对任意随机变量 X X X ,定义累积分布函数(cumulative distributution function, CDF)
F : R → [ 0 , 1 ] , ∀ x ∈ R , F ( x ) = P ( X ≤ x ) F:\mathbb{R}\to[0,1],\forall x\in \mathbb{R},F(x)=P(X\le x) F:R[0,1],xR,F(x)=P(Xx)
X X X 是离散的, F ( x ) = ∑ u ≤ x f ( u ) F(x)=\sum\limits_{u\le x}f(u) F(x)=uxf(u)

X X X 是连续的, F ( x ) = ∫ − ∞ x f ( u ) d u F(x)=\int_{-\infty}^xf(u)du F(x)=xf(u)du

1.4.1 二元随机变量

X = ( X 1 X 2 ) , X : O → R 2 \mathbf{X}=\left ( \begin{matrix} X_1 \\ X_2 \end{matrix} \right ), \mathbf{X}:\mathcal{O}\to\mathbb{R}^2 X=(X1X2),X:OR2 此处 X 1 X_1 X1 X 2 X_2 X2 分别是两个随机变量。

上课时略去了很多概念,补上。

Def.8. X 1 X_1 X1 X 2 X_2 X2 都是离散,那么 X \mathbf{X} X 的联合概率质量函数被定义为:
f ( x ) = f ( x 1 , x 2 ) = P ( X 1 = x 1 , X 2 = x 2 ) = P ( X = x ) f(\mathbf{x})=f(x_1,x_2)=P(X_1=x_1,X_2=x_2)=P(\mathbf{X}=\mathbf{x}) f(x)=f(x1,x2)=P(X1=x1,X2=x2)=P(X=x)
注: f ( x ) ≥ 0 , ∑ x 1 ∑ x 2 f ( x 1 , x 2 ) = 1 f(x)\ge0,\sum\limits_{x_1}\sum\limits_{x_2}f(x_1,x_2)=1 f(x)0,x1x2f(x1,x2)=1

Def.9. X 1 X_1 X1 X 2 X_2 X2 都是连续,那么 X \mathbf{X} X 的联合概率密度函数被定义为:
P ( X ∈ W ) = ∬ x ∈ W f ( x ) d x = ∬ ( x 1 , x 2 ) ∈ T W f ( x 1 , x 2 ) d x 1 d x 2 P(\mathbf{X} \in W)=\iint\limits_{\mathbf{x} \in W} f(\mathbf{x}) d \mathbf{x}=\iint\limits_{\left(x_{1}, x_{2}\right)^T_{\in} W} f\left(x_{1}, x_{2}\right) d x_{1} d x_{2} P(XW)=xWf(x)dx=(x1,x2)TWf(x1,x2)dx1dx2
其中, W ⊂ R 2 W \subset \mathbb{R}^2 WR2 f ( x ) ≥ 0 , ∬ x ∈ R 2 f ( x ) d x = 1 f(\mathbf{x})\ge0,\iint\limits_{\mathbf{x}\in\mathbb{R}^2}f(\mathbf{x})d\mathbf{x}=1 f(x)0,xR2f(x)dx=1

Def.10. X \mathbf{X} X 的联合累积分布函数 F F F
F ( x 1 , x 2 ) = P ( X 1 ≤ x 1  and  X 2 ≤ x 2 ) = P ( X ≤ x ) F(x_1,x_2)=P(X_1\le x_1 \text{ and } X_2\le x_2)=P(\mathbf{X}\le\mathbf{x}) F(x1,x2)=P(X1x1 and X2x2)=P(Xx)
Def.11. X 1 X_1 X1 X 2 X_2 X2 是独立的,如果 ∀ W 1 ⊂ R \forall W_1\subset \mathbb{R} W1R ∀ W 2 ⊂ R \forall W_2\subset \mathbb{R} W2R
P ( X 1 ∈ W 1  and  X 2 ∈ W 2 ) = P ( X 1 ∈ W 1 ) ⋅ ( X 2 ∈ W 2 ) P(X_1\in W_1 \text{ and } X_2\in W_2)=P(X_1\in W_1)\cdot(X_2\in W_2) P(X1W1 and X2W2)=P(X1W1)(X2W2)
Prop. 如果 X 1 X_1 X1 X 2 X_2 X2 是独立的,那么
F ( x 1 , x 2 ) = F 1 ( x 1 ) ⋅ F 2 ( x 2 ) f ( x 1 , x 2 ) = f 1 ( x 1 ) ⋅ f 2 ( x 2 ) F(x_1,x_2)=F_1(x_1)\cdot F_2(x_2)\\ f(x_1,x_2)=f_1(x_1)\cdot f_2(x_2) F(x1,x2)=F1(x1)F2(x2)f(x1,x2)=f1(x1)f2(x2)
其中 F i F_i Fi X i X_i Xi 的累积分布函数, f i f_i fi x i x_i xi 的 PMF 或 PDF。

1.4.2 多元随机变量

平行推广1.4.1节中的各定义即可。

1.4.3 随机样本与统计量

Def.12. 给定随机变量 X X X ,来源于 X X X 的长度为 n n n 的随机样本是指 n n n 个独立的且同分布(均与 X X X 具有同样的 PMF 或 PDF)的随机变量 S 1 , S 2 , ⋯   , S n S_1,S_2,\cdots,S_n S1,S2,,Sn

Def.13. 统计量 θ ^ \hat{\theta} θ^ 被定义为关于随机样本的函数 θ ^ : ( S 1 , S 2 , ⋯   , S n ) → R \hat{\theta}:(S_1,S_2,\cdots,S_n)\to \mathbb{R} θ^:(S1,S2,,Sn)R

θ ^ \hat{\theta} θ^ 本身也是随机变量

Chapter 2:数值属性

关注代数、几何与统计观点。

2.1 一元分析

仅关注一项属性, D = ( X x 1 x 2 ⋮ x n ) , x i ∈ R \mathbf{D}=\left(\begin{array}{c} X \\ \hline x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{array}\right),x_i\in\mathbb{R} D=Xx1x2xn,xiR

统计: X X X 可视为(高维)随机变量, x i x_i xi 均是恒等随机变量, x 1 , ⋯   , x n x_1,\cdots,x_n x1,,xn 也看作源于 X X X 的长度为 n n n 的随机样本。

Def.1. 经验积累分布函数

Def.2. 反积累分布函数

Def.3. 随机变量 X X X 的经验概率质量函数是指
f ^ ( x ) = 1 n ∑ i = 1 n I ( x i = x ) , ∀ x i ∈ R I ( x i = x ) = { 1 , x i = x 0 , x i ≠ x \hat{f}(x)=\frac{1}{n} \sum_{i=1}^{n} I\left(x_{i} = x\right),\forall x_i \in \mathbb{R}\\ I\left(x_{i} = x\right)=\left\{\begin{matrix} 1,x_i=x\\ 0,x_i\ne x \end{matrix}\right. f^(x)=n1i=1nI(xi=x),xiRI(xi=x)={1,xi=x0,xi=x

2.1.1 集中趋势量数

Def.4. 离散随机变量 X X X 的期望是指: μ : = E ( X ) = ∑ x x f ( x ) \mu:=E(X) = \sum\limits_{x} xf(x) μ:=E(X)=xxf(x) f ( x ) f(x) f(x) X X X 的PMF

连续随机变量 X X X 的期望是指: μ : = E ( X ) = ∫ − ∞ + ∞ x f ( x ) d x \mu:=E(X) = \int\limits_{-\infin}^{+\infin} xf(x)dx μ:=E(X)=+xf(x)dx f ( x ) f(x) f(x) X X X 的PDF

E ( a X + b Y ) = a E ( X ) + b E ( Y ) E(aX+bY)=aE(X)+bE(Y) E(aX+bY)=aE(X)+bE(Y)

Def.5. X X X 的样本平均值是指 μ ^ = 1 n ∑ i = 1 n x i \hat{\mu}=\frac{1}{n} \sum\limits_{i=1}^{n}x_i μ^=n1i=1nxi,注 μ ^ \hat{\mu} μ^ μ \mu μ 的估计量

Def.6. 一个估计量(统计量) θ ^ \hat{\theta} θ^ 被称作统计量 θ \theta θ 的无偏估计,如果 E ( θ ^ ) = θ E(\hat{\theta})=\theta E(θ^)=θ

自证:样本平均值 μ ^ \hat{\mu} μ^ 是期望 μ \mu μ 的无偏估计量, E ( x i ) = μ  for all  x i E(x_i)=\mu \text{ for all } x_i E(xi)=μ for all xi

Def.7. 一个估计量是稳健的,如果它不会被样本中的极值影响。(样本平均值并不是稳健的。)

Def.8. 随机变量 X X X 的中位数

Def.9. 随机变量 X X X 的样本中位数

Def.10. 随机变量 X X X 的众数, 随机变量 X X X 的样本众数

2.2.2 离差量数

Def.11. 随机变量 X X X 的极差与样本极差

Def.12. 随机变量 X X X 的四分位距,样本的四分位距

Def.13. 随机变量 X X X 的方差是
σ 2 = var ⁡ ( X ) = E [ ( X − μ ) 2 ] = { ∑ x ( x − μ ) 2 f ( x )  if  X  is discrete  ∫ − ∞ ∞ ( x − μ ) 2 f ( x ) d x  if  X  is continuous  \sigma^{2}=\operatorname{var}(X)=E\left[(X-\mu)^{2}\right]=\left\{\begin{array}{ll} \sum_{x}(x-\mu)^{2} f(x) & \text { if } X \text { is discrete } \\ \\ \int_{-\infty}^{\infty}(x-\mu)^{2} f(x) d x & \text { if } X \text { is continuous } \end{array}\right. σ2=var(X)=E[(Xμ)2]=x(xμ)2f(x)(xμ)2f(x)dx if X is discrete  if X is continuous 

标准差 σ \sigma σ 是指 σ 2 \sigma^2 σ2 的正的平方根。

注:方差是关于期望的第二阶动差, r r r 阶动差是指 E [ ( x − μ ) r ] E[(x-\mu)^r] E[(xμ)r]

性质

  1. σ 2 = E ( X 2 ) − μ 2 = E ( X 2 ) − [ E ( X ) ] 2 \sigma^2=E(X^2)-\mu^2=E(X^2)-[E(X)]^2 σ2=E(X2)μ2=E(X2)[E(X)]2
  2. v a r ( X 1 + X 2 ) = v a r ( X 1 ) + v a r ( X 2 ) var(X_1+X_2)=var(X_1)+var(X_2) var(X1+X2)=var(X1)+var(X2) X 1 , X 2 X_1,X_2 X1,X2 独立

Def.14. 样本方差是 σ ^ 2 = 1 n ∑ i = 1 n ( x i − μ ^ ) 2 \hat{\sigma}^{2}=\frac{1}{n} \sum\limits_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)^{2} σ^2=n1i=1n(xiμ^)2,底下非 n − 1 n-1 n1

样本方差的几何意义:考虑中心化数据矩阵
C : = ( x 1 − μ ^ x 2 − μ ^ ⋮ x n − μ ^ ) n ⋅ σ ^ 2 = ∑ i = 1 n ( x i − μ ^ ) 2 = ∣ ∣ C ∣ ∣ 2 C:=\left(\begin{array}{c} x_{1}-\hat{\mu} \\ x_{2}-\hat{\mu} \\ \vdots \\ x_{n}-\hat{\mu} \end{array}\right)\\ n\cdot \hat{\sigma}^2=\sum\limits_{i=1}^{n}\left(x_{i}-\hat{\mu}\right)^{2}=||C||^2 C:=x1μ^x2μ^xnμ^nσ^2=i=1n(xiμ^)2=C2
问题: X X X 的样本平均数的期望与方差?
E ( μ ^ ) = E ( 1 n ∑ i = 1 n x i ) = 1 n ∑ i = 1 n E ( x i ) = 1 n ∑ i = 1 n μ = μ E(\hat{\mu})=E(\frac{1}{n} \sum\limits_{i=1}^{n}x_i)=\frac{1}{n} \sum\limits_{i=1}^{n} E(x_i)=\frac{1}{n}\sum\limits_{i=1}^{n}\mu=\mu\\ E(μ^)=E(n1i=1nxi)=n1i=1nE(xi)=n1i=1nμ=μ
方差有两种方法:第一种直接展开,第二种:运用 x 1 , ⋯   , x n x_1,\cdots,x_n x1,,xn 独立同分布:
v a r ( ∑ i = 1 n x i ) ) = ∑ i = 1 n v a r ( x i ) = n ⋅ σ 2 ⟹ v a r ( μ ^ ) = σ 2 n var(\sum\limits_{i=1}^{n}x_i))=\sum\limits_{i=1}^{n}var(x_i)=n\cdot \sigma^2\Longrightarrow var(\hat{\mu})=\frac{\sigma^2}{n} var(i=1nxi))=i=1nvar(xi)=nσ2var(μ^)=nσ2
注:样本方差是有偏估计,因为: E ( σ 2 ) = ( n − 1 n ) σ 2 → n → + ∞ σ 2 E(\sigma^2)=(\frac{n-1}{n})\sigma^2\xrightarrow{n\to +\infin}\sigma^2 E(σ2)=(nn1)σ2n+ σ2

2.2 二元分析

2.3 多元分析

D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D=x1x2xnX1x11x21xn1X2x12x22xn2Xdx1dx2dxnd

可视为: X = ( X 1 , ⋯   , X d ) T \mathbf{X}=(X_1,\cdots,X_d)^T X=(X1,,Xd)T

Def.15. 对于随机变量向量 X \mathbf{X} X,其期望向量为: E [ X ] = ( E [ X 1 ] E [ X 2 ] ⋮ E [ X d ] ) E[\mathbf{X}]=\left(\begin{array}{c} E\left[X_{1}\right] \\ E\left[X_{2}\right] \\ \vdots \\ E\left[X_{d}\right] \end{array}\right) E[X]=E[X1]E[X2]E[Xd]

样本平均值为: μ ^ = 1 n ∑ i = 1 n x i , ( = m e a n ( D ) ) ∈ R d \hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^{n} \mathbf{x}_{i},(=mean(\mathbf{D})) \in \mathbb{R}^{d} μ^=n1i=1nxi,(=mean(D))Rd

Def.16. 对于 X 1 , X 2 X_1,X_2 X1,X2,定义协方差 σ 12 = E [ ( X 1 − E ( X 1 ) ) ( X 2 − E ( X 2 ) ] = E ( X 1 X 2 ) − E ( X 1 ) E ( X 2 ) \sigma_{12}=E[(X_1-E(X_1))(X_2-E(X_2)]=E(X_1X_2)-E(X_1)E(X_2) σ12=E[(X1E(X1))(X2E(X2)]=E(X1X2)E(X1)E(X2)

Remark:

  1. σ 12 = σ 21 \sigma_{12}=\sigma_{21} σ12=σ21
  2. 若两者独立,则 σ 12 = 0 \sigma_{12}=0 σ12=0

Def.17. 对于随机变量向量 X = ( X 1 , ⋯   , X d ) T \mathbf{X}=(X_1,\cdots,X_d)^T X=(X1,,Xd)T,定义协方差矩阵:
Σ = E [ ( X − μ ) ( X − μ ) T ] = ( σ 1 2 σ 12 ⋯ σ 1 d σ 21 σ 2 2 ⋯ σ 2 d ⋯ ⋯ ⋯ ⋯ σ d 1 σ d 2 ⋯ σ d 2 ) d × d \boldsymbol{\Sigma}=E\left[(\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^{T}\right]=\left(\begin{array}{cccc} \sigma_{1}^{2} & \sigma_{12} & \cdots & \sigma_{1 d} \\ \sigma_{21} & \sigma_{2}^{2} & \cdots & \sigma_{2 d} \\ \cdots & \cdots & \cdots & \cdots \\ \sigma_{d 1} & \sigma_{d 2} & \cdots & \sigma_{d}^{2} \end{array}\right)_{d\times d} Σ=E[(Xμ)(Xμ)T]=σ12σ21σd1σ12σ22σd2σ1dσ2dσd2d×d
其为对称矩阵,定义 X \mathbf{X} X 的广义方差为 d e t ( Σ ) det(\boldsymbol{\Sigma}) det(Σ)

  1. Σ \boldsymbol{\Sigma} Σ 是实对称矩阵且半正定,即所有特征值非负, λ 1 ≥ λ 2 ⋯ ≥ λ d ≥ 0 \lambda_1\ge \lambda_2 \cdots \ge\lambda_d \ge 0 λ1λ2λd0
  2. v a r ( D ) = t r ( Σ ) = σ 1 2 + ⋯ + σ d 2 var(\mathbf{D})=tr(\Sigma)=\sigma_1^2+\cdots+\sigma_d^2 var(D)=tr(Σ)=σ12++σd2

Def.18. 对于 X = ( X 1 , ⋯   , X d ) T \mathbf{X}=(X_1,\cdots,X_d)^T X=(X1,,Xd)T,定义样本协方差矩阵
Σ ^ = 1 n ( Z T Z ) = 1 n ( Z 1 T Z 1 Z 1 T Z 2 ⋯ Z 1 T Z d Z 2 T Z 1 Z 2 T Z 2 ⋯ Z 2 T Z d ⋮ ⋮ ⋱ ⋮ Z d T Z 1 Z d T Z 2 ⋯ Z d T Z d ) d × d \hat{\boldsymbol{\Sigma}}=\frac{1}{n}\left(\mathbf{Z}^{T} \mathbf{Z}\right)=\frac{1}{n}\left(\begin{array}{cccc} Z_{1}^{T} Z_{1} & Z_{1}^{T} Z_{2} & \cdots & Z_{1}^{T} Z_{d} \\ Z_{2}^{T} Z_{1} & Z_{2}^{T} Z_{2} & \cdots & Z_{2}^{T} Z_{d} \\ \vdots & \vdots & \ddots & \vdots \\ Z_{d}^{T} Z_{1} & Z_{d}^{T} Z_{2} & \cdots & Z_{d}^{T} Z_{d} \end{array}\right)_{d\times d} Σ^=n1(ZTZ)=n1Z1TZ1Z2TZ1ZdTZ1Z1TZ2Z2TZ2ZdTZ2Z1TZdZ2TZdZdTZdd×d
其中
Z = D − 1 ⋅ μ ^ T = ( x 1 T − μ ^ T x 2 T − μ ^ T ⋮ x n T − μ ^ T ) = ( − z 1 T − − z 2 T − ⋮ − z n T − ) = ( ∣ ∣ ∣ Z 1 Z 2 ⋯ Z d ∣ ∣ ∣ ) \mathbf{Z}=\mathbf{D}-\mathbf{1} \cdot \hat{\boldsymbol{\mu}}^{T}=\left(\begin{array}{c} \mathbf{x}_{1}^{T}-\hat{\boldsymbol{\mu}}^{T} \\ \mathbf{x}_{2}^{T}-\hat{\boldsymbol{\mu}}^{T} \\ \vdots \\ \mathbf{x}_{n}^{T}-\hat{\boldsymbol{\mu}}^{T} \end{array}\right)=\left(\begin{array}{ccc} -& \mathbf{z}_{1}^{T} & - \\ -& \mathbf{z}_{2}^{T} & - \\ & \vdots \\ -& \mathbf{z}_{n}^{T} & - \end{array}\right)=\left(\begin{array}{cccc} \mid & \mid & & \mid \\ Z_{1} & Z_{2} & \cdots & Z_{d} \\ \mid & \mid & & \mid \end{array}\right) Z=D1μ^T=x1Tμ^Tx2Tμ^TxnTμ^T=z1Tz2TznT=Z1Z2Zd

样本总方差是 t r ( Σ ^ ) tr(\hat{\boldsymbol{\Sigma}}) tr(Σ^),广义样本方差是 d e t ( Σ ^ ) ≥ 0 det(\hat{\boldsymbol{\Sigma}})\ge0 det(Σ^)0

Σ ^ = 1 n ∑ i = 1 n z i z i T \hat{\boldsymbol{\Sigma}}=\frac{1}{n}\sum\limits_{i=1}^n\mathbf{z}_{i}\mathbf{z}_{i}^T Σ^=n1i=1nziziT

Chapter 5 Kernel Method:核方法

Example 5.1 略, ϕ ( 核 映 射 ) : Σ ∗ ( 输 入 空 间 ) → R 4 ( 特 征 空 间 ) \phi(核映射):\Sigma^*(输入空间)\to \mathbb{R}^4(特征空间) ϕ():Σ()R4()

Def.1. 假设核映射 ϕ : I → F \phi:\mathcal{I}\to \mathcal{F} ϕ:IF ϕ \phi ϕ 的核函数是指 K : I × I → R K:\mathcal{I}\times\mathcal{I}\to \mathbb{R} K:I×IR 使得 ∀ ( x i , x j ) ∈ I × I , K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) \forall (\mathbf{x}_i,\mathbf{x}_j)\in \mathcal{I}\times\mathcal{I},K(\mathbf{x}_i,\mathbf{x}_j)=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j) (xi,xj)I×I,K(xi,xj)=ϕT(xi)ϕ(xj)

Example 5.2 ϕ : R 2 → R 3 \phi:\mathbb{R}^2\to \mathbb{R}^3 ϕ:R2R3 使得 ∀ a = ( a 1 , a 2 ) , ϕ ( a ) = ( a 1 2 , a 2 2 , 2 a 1 a 2 ) T \forall \mathbf{a}=(a_1,a_2),\phi(\mathbf{a})=(a_1^2,a_2^2,\sqrt2a_1a_2)^T a=(a1a2),ϕ(a)=(a12,a22,2 a1a2)T

注意到 K ( a , b ) = ϕ ( a ) T ϕ ( b ) = a 1 2 b 1 2 + a 2 2 b 2 2 + 2 a 1 2 a 2 2 b 1 2 b 2 2 K(\mathbf{a},\mathbf{b})=\phi(\mathbf{a})^T\phi(\mathbf{b})=a_1^2b_1^2+a_2^2b_2^2+2a_1^2a_2^2b_1^2b_2^2 K(a,b)=ϕ(a)Tϕ(b)=a12b12+a22b22+2a12a22b12b22 K : R 2 × R 2 → R K:\mathbb{R}^2\times\mathbb{R}^2\to \mathbb{R} K:R2×R2R

Remark

  1. 分析复杂数据
  2. 分析非线性特征(知乎搜核函数有什么作用)

Goal:在未知 ϕ \phi ϕ 的情况下,通过分析 K K K 来分析特征空间 F \mathcal{F} F 结果。

5.1 核矩阵

D = { x 1 , x 2 , … , x n } ⊂ I \mathbf{D}=\left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{n}\right\} \subset \mathcal{I} D={x1,x2,,xn}I,其核矩阵定义为: K = [ K ( x i , x j ) ] n × n \mathbf{K}=[K(\mathbf{x}_{i},\mathbf{x}_{j})]_{n\times n} K=[K(xi,xj)]n×n

Prop. 核矩阵 K \mathbf{K} K 是对称的且半正定的

Proof. K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) = ϕ T ( x j ) ϕ ( x i ) = K ( x j , x i ) K(\mathbf{x}_{i},\mathbf{x}_{j})=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j)=\phi^T(\mathbf{x}_j)\phi(\mathbf{x}_i)=K(\mathbf{x}_{j},\mathbf{x}_{i}) K(xi,xj)=ϕT(xi)ϕ(xj)=ϕT(xj)ϕ(xi)=K(xj,xi),故对称。

对于 ∀ a T ∈ R n \forall \mathbf{a}^{T}\in \mathbb{R}^n aTRn
a T K a = ∑ i = 1 n ∑ j = 1 n a i a j K ( x i , x j ) = ∑ i = 1 n ∑ j = 1 n a i a j ϕ ( x i ) T ϕ ( x j ) = ( ∑ i = 1 n a i ϕ ( x i ) ) T ( ∑ j = 1 n a j ϕ ( x j ) ) = ∥ ∑ i = 1 n a i ϕ ( x i ) ∥ 2 ≥ 0 \begin{aligned} \mathbf{a}^{T} \mathbf{K a} &=\sum_{i=1}^{n} \sum_{j=1}^{n} a_{i} a_{j} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \\ &=\sum_{i=1}^{n} \sum_{j=1}^{n} a_{i} a_{j} \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) \\ &=\left(\sum_{i=1}^{n} a_{i} \phi\left(\mathbf{x}_{i}\right)\right)^{T}\left(\sum_{j=1}^{n} a_{j} \phi\left(\mathbf{x}_{j}\right)\right) \\ &=\left\|\sum_{i=1}^{n} a_{i} \phi\left(\mathbf{x}_{i}\right)\right\|^2 \geq 0 \end{aligned} aTKa=i=1nj=1naiajK(xi,xj)=i=1nj=1naiajϕ(xi)Tϕ(xj)=(i=1naiϕ(xi))T(j=1najϕ(xj))=i=1naiϕ(xi)20

5.1.1 核映射的重构

”经验核映射“

已知 D = { x i } i = 1 n ⊂ I \mathbf{D}=\left\{\mathbf{x}_{i}\right\}_{i=1}^{n} \subset \mathcal{I} D={xi}i=1nI 与核矩阵 K \mathbf{K} K

目标:寻找 ϕ : I → F ⊂ R n \phi:\mathcal{I} \to \mathcal{F} \subset \mathbb{R}^n ϕ:IFRn

首先尝试: ∀ x ∈ I , ϕ ( x ) = ( K ( x 1 , x ) , K ( x 2 , x ) , … , K ( x n , x ) ) T ∈ R n \forall \mathbf{x} \in \mathcal{I},\phi(\mathbf{x})=\left(K\left(\mathbf{x}_{1}, \mathbf{x}\right), K\left(\mathbf{x}_{2}, \mathbf{x}\right), \ldots, K\left(\mathbf{x}_{n}, \mathbf{x}\right)\right)^{T} \in \mathbb{R}^{n} xI,ϕ(x)=(K(x1,x),K(x2,x),,K(xn,x))TRn

检查: ϕ T ( x i ) ϕ ( x j ) ? = K ( x i , x j ) \phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j)?=K(\mathbf{x}_{i},\mathbf{x}_{j}) ϕT(xi)ϕ(xj)?=K(xi,xj)

左边 = ϕ ( x i ) T ϕ ( x j ) = ∑ k = 1 n K ( x k , x i ) K ( x k , x j ) = K i T K j =\phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right)=\sum\limits_{k=1}^{n} K\left(\mathbf{x}_{k}, \mathbf{x}_{i}\right) K\left(\mathbf{x}_{k}, \mathbf{x}_{j}\right)=\mathbf{K}_{i}^{T} \mathbf{K}_{j} =ϕ(xi)Tϕ(xj)=k=1nK(xk,xi)K(xk,xj)=KiTKj K i \mathbf{K}_{i} Ki 代表第 i i i 行或列要求太高。

考虑改进:寻找矩阵 A \mathbf{A} A 使得, K i T A K j = K ( x i , x j ) \mathbf{K}_{i}^{T} \mathbf{A} \mathbf{K}_{j}=\mathbf{K}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) KiTAKj=K(xi,xj),即 K T A K = K \mathbf{K}^{T} \mathbf{A} \mathbf{K}=\mathbf{K} KTAK=K

故只需取 A = K − 1 \mathbf{A}=\mathbf{K}^{-1} A=K1 即可( K \mathbf{K} K 可逆)

K \mathbf{K} K 正定, K − 1 \mathbf{K}^{-1} K1 也正定,即存在一个实矩阵 B \mathbf{B} B 满足 K − 1 = B T B \mathbf{K}^{-1}=\mathbf{B}^{T}\mathbf{B} K1=BTB

故经验核函数可定义为:
ϕ ( x ) = B ⋅ ( K ( x 1 , x ) , K ( x 2 , x ) , … , K ( x n , x ) ) T \phi(\mathbf{x})=\mathbf{B}\cdot\left(K\left(\mathbf{x}_{1}, \mathbf{x}\right), K\left(\mathbf{x}_{2}, \mathbf{x}\right), \ldots, K\left(\mathbf{x}_{n}, \mathbf{x}\right)\right)^{T} ϕ(x)=B(K(x1,x),K(x2,x),,K(xn,x))T
检查: ϕ T ( x i ) ϕ ( x j ) = ( B K i ) T ( B K j ) = K i T K − 1 K j = ( K T K − 1 K ) i , j = K ( x i , x j ) \phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j)=(\mathbf{B}\mathbf{K}_i)^T(\mathbf{B}\mathbf{K}_j)=\mathbf{K}_i^T\mathbf{K}^{-1}\mathbf{K}_j=(\mathbf{K}^T\mathbf{K}^{-1}\mathbf{K})_{i,j}=K(\mathbf{x}_{i},\mathbf{x}_{j}) ϕT(xi)ϕ(xj)=(BKi)T(BKj)=KiTK1Kj=(KTK1K)i,j=K(xi,xj)

5.1.2 特定数据的海塞核映射

对于对称半正定矩阵 K n × n \mathbf{K}_{n\times n} Kn×n,存在分解
K = U ( λ 1 0 ⋯ 0 0 λ 2 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ λ n ) U T = U Λ U T \mathbf{K}=\mathbf{U}\left(\begin{array}{cccc} \lambda_{1} & 0 & \cdots & 0 \\ 0 & \lambda_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{n} \end{array}\right)\mathbf{U}^{T}=\mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^{T} K=Uλ1000λ2000λnUT=UΛUT
λ i \lambda_{i} λi 为特征值, U = ( ∣ ∣ ∣ u 1 u 2 ⋯ u n ∣ ∣ ∣ ) \mathbf{U}=\left(\begin{array}{cccc} \mid & \mid & & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} & \cdots & \mathbf{u}_{n} \\ \mid & \mid & & \mid \end{array}\right) U=u1u2un 为单位正交矩阵, u i = ( u i 1 , u i 2 , … , u i n ) T ∈ R n \mathbf{u}_{i}=\left(u_{i 1}, u_{i 2}, \ldots, u_{i n}\right)^{T} \in \mathbb{R}^{n} ui=(ui1,ui2,,uin)TRn 为特征向量,即
K = λ 1 u 1 u 1 T + λ 2 u 2 u 2 T + ⋯ + λ n u n u n T K ( x i , x j ) = λ 1 u 1 i u 1 j + λ 2 u 2 i u 2 j ⋯ + λ n u n i u n j = ∑ k = 1 n λ k u k i u k j \mathbf{K}=\lambda_{1} \mathbf{u}_{1} \mathbf{u}_{1}^{T}+\lambda_{2} \mathbf{u}_{2} \mathbf{u}_{2}^{T}+\cdots+\lambda_{n} \mathbf{u}_{n} \mathbf{u}_{n}^{T}\\ \begin{aligned} \mathbf{K}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) &=\lambda_{1} u_{1 i} u_{1 j}+\lambda_{2} u_{2 i} u_{2 j} \cdots+\lambda_{n} u_{n i} u_{n j} \\ &=\sum_{k=1}^{n} \lambda_{k} u_{k i} u_{k j} \end{aligned} K=λ1u1u1T+λ2u2u2T++λnununTK(xi,xj)=λ1u1iu1j+λ2u2iu2j+λnuniunj=k=1nλkukiukj
定义海塞映射:
∀ x i ∈ D , ϕ ( x i ) = ( λ 1 u 1 i , λ 2 u 2 i , … , λ n u n i ) T \forall \mathbf{x}_i \in \mathbf {D}, \phi\left(\mathbf{x}_{i}\right)=\left(\sqrt{\lambda_{1}} u_{1 i}, \sqrt{\lambda_{2}} u_{2 i}, \ldots, \sqrt{\lambda_{n}} u_{n i}\right)^{T} xiD,ϕ(xi)=(λ1 u1i,λ2 u2i,,λn uni)T
检查:
ϕ ( x i ) T ϕ ( x j ) = ( λ 1 u 1 i , … , λ n u n i ) ( λ 1 u 1 j , … , λ n u n j ) T = λ 1 u 1 i u 1 j + ⋯ + λ n u n i u n j = K ( x i , x j ) \begin{aligned} \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) &=\left(\sqrt{\lambda_{1}} u_{1 i}, \ldots, \sqrt{\lambda_{n}} u_{n i}\right)\left(\sqrt{\lambda_{1}} u_{1 j}, \ldots, \sqrt{\lambda_{n}} u_{n j}\right)^{T} \\ &=\lambda_{1} u_{1 i} u_{1 j}+\cdots+\lambda_{n} u_{n i} u_{n j}=K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \end{aligned} ϕ(xi)Tϕ(xj)=(λ1 u1i,,λn uni)(λ1 u1j,,λn unj)T=λ1u1iu1j++λnuniunj=K(xi,xj)
注意:海塞映射中仅对 D \mathbf{D} D 中的数 x i \mathbf{x}_i xi 有定义。

5.2 向量核函数

R d × R d → R \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R} Rd×RdR

典型向量核函数:多项式核

∀ x , y ∈ R d , K q ( x , y ) = ϕ ( x ) T ϕ ( y ) = ( x T y + c ) q \forall \mathbf{x},\mathbf{y} \in \mathbb {R}^d, K_{q}(\mathbf{x}, \mathbf{y})=\phi(\mathbf{x})^{T} \phi(\mathbf{y})=\left(\mathbf{x}^{T} \mathbf{y} + c \right)^{q} x,yRd,Kq(x,y)=ϕ(x)Tϕ(y)=(xTy+c)q,其中 c ≥ 0 c\ge 0 c0

c = 0 c=0 c=0,齐次,否则为非齐次。

问题:构造核映射 ϕ : R d → F \phi:\mathbb{R}^d \to \mathcal{F} ϕ:RdF,使得 K q ( x , y ) = ϕ ( x ) T ϕ ( y ) K_{q}(\mathbf{x}, \mathbf{y})=\phi(\mathbf{x})^{T} \phi(\mathbf{y}) Kq(x,y)=ϕ(x)Tϕ(y)

注: q = 1 , c = 0 , ϕ ( x ) = x q=1,c=0, \phi (\mathbf{x})=\mathbf{x} q=1,c=0,ϕ(x)=x

示例: q = 2 , d = 2 q=2,d=2 q=2,d=2

高斯核自读

5.3 特征空间中基本核运算

ϕ : I → F , K : I × I → R \phi:\mathcal{I} \to \mathcal{F}, K:\mathcal{I} \times \mathcal{I}\to \mathbb{R} ϕ:IF,K:I×IR

  • 向量长度: ∥ ϕ ( x ) ∥ 2 = ϕ ( x ) T ϕ ( x ) = K ( x , x ) \|\phi(\mathbf{x})\|^{2}=\phi(\mathbf{x})^{T} \phi(\mathbf{x})=K(\mathbf{x}, \mathbf{x}) ϕ(x)2=ϕ(x)Tϕ(x)=K(x,x)

  • 距离:
    ∥ ϕ ( x i ) − ϕ ( x j ) ∥ 2 = ∥ ϕ ( x i ) ∥ 2 + ∥ ϕ ( x j ) ∥ 2 − 2 ϕ ( x i ) T ϕ ( x j ) = K ( x i , x i ) + K ( x j , x j ) − 2 K ( x i , x j ) \begin{aligned} \left\|\phi\left(\mathbf{x}_{i}\right)-\phi\left(\mathbf{x}_{j}\right)\right\|^{2} &=\left\|\phi\left(\mathbf{x}_{i}\right)\right\|^{2}+\left\|\phi\left(\mathbf{x}_{j}\right)\right\|^{2}-2 \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) \\ &=K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)+K\left(\mathbf{x}_{j}, \mathbf{x}_{j}\right)-2 K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \end{aligned} ϕ(xi)ϕ(xj)2=ϕ(xi)2+ϕ(xj)22ϕ(xi)Tϕ(xj)=K(xi,xi)+K(xj,xj)2K(xi,xj)

2 K ( x , y ) = ∥ ϕ ( x ) ∥ 2 + ∥ ϕ ( y ) ∥ 2 − ∥ ϕ ( x ) − ϕ ( y ) ∥ 2 2 K\left(\mathbf{x}, \mathbf{y}\right)=\left\|\phi\left(\mathbf{x}\right)\right\|^{2}+\left\|\phi\left(\mathbf{y}\right)\right\|^{2}-\left\|\phi\left(\mathbf{x}\right)-\phi\left(\mathbf{y}\right)\right\|^{2} 2K(x,y)=ϕ(x)2+ϕ(y)2ϕ(x)ϕ(y)2

代表 ϕ ( x ) \phi\left(\mathbf{x}\right) ϕ(x) ϕ ( y ) \phi\left(\mathbf{y}\right) ϕ(y) 的相似度

  • 平均值: μ ϕ = 1 n ∑ i = 1 n ϕ ( x i ) \boldsymbol{\mu}_{\phi}=\frac{1}{n} \sum\limits_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right) μϕ=n1i=1nϕ(xi)
    ∥ μ ϕ ∥ 2 = μ ϕ T μ ϕ = ( 1 n ∑ i = 1 n ϕ ( x i ) ) T ( 1 n ∑ j = 1 n ϕ ( x j ) ) = 1 n 2 ∑ i = 1 n ∑ j = 1 n ϕ ( x i ) T ϕ ( x j ) = 1 n 2 ∑ i = 1 n ∑ j = 1 n K ( x i , x j ) \begin{aligned} \left\|\boldsymbol{\mu}_{\phi}\right\|^{2} &=\boldsymbol{\mu}_{\phi}^{T} \boldsymbol{\mu}_{\phi} \\ &=\left(\frac{1}{n} \sum\limits_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right)\right)^{T}\left(\frac{1}{n} \sum\limits_{j=1}^{n} \phi\left(\mathbf{x}_{j}\right)\right) \\ &=\frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) \\ &=\frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \end{aligned} μϕ2=μϕTμϕ=(n1i=1nϕ(xi))T(n1j=1nϕ(xj))=n21i=1nj=1nϕ(xi)Tϕ(xj)=n21i=1nj=1nK(xi,xj)

  • 总方差: σ ϕ 2 = 1 n ∑ i = 1 n ∥ ϕ ( x i ) − μ ϕ ∥ 2 \sigma_{\phi}^{2}=\frac{1}{n} \sum\limits_{i=1}^{n}\left\|\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi}\right\|^{2} σϕ2=n1i=1nϕ(xi)μϕ2 ∀ x i \forall \mathbf{x}_{i} xi
    ∥ ϕ ( x i ) − μ ϕ ∥ 2 = ∥ ϕ ( x i ) ∥ 2 − 2 ϕ ( x i ) T μ ϕ + ∥ μ ϕ ∥ 2 = K ( x i , x i ) − 2 n ∑ j = 1 n K ( x i , x j ) + 1 n 2 ∑ s = 1 n ∑ t = 1 n K ( x s , x t ) \begin{aligned} \left\|\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi}\right\|^{2} &=\left\|\phi\left(\mathbf{x}_{i}\right)\right\|^{2}-2 \phi\left(\mathbf{x}_{i}\right)^{T} \boldsymbol{\mu}_{\phi}+\left\|\boldsymbol{\mu}_{\phi}\right\|^{2} \\ &=K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)-\frac{2}{n} \sum_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)+\frac{1}{n^{2}} \sum_{s=1}^{n} \sum_{t=1}^{n} K\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right) \end{aligned} ϕ(xi)μϕ2=ϕ(xi)22ϕ(xi)Tμϕ+μϕ2=K(xi,xi)n2j=1nK(xi,xj)+n21s=1nt=1nK(xs,xt)

    σ ϕ 2 = 1 n ∑ i = 1 n ∥ ϕ ( x i ) − μ ϕ ∥ 2 = 1 n ∑ i = 1 n ( K ( x i , x i ) − 2 n ∑ j = 1 n K ( x i , x j ) + 1 n 2 ∑ s = 1 n ∑ t = 1 n K ( x s , x t ) ) = 1 n ∑ i = 1 n K ( x i , x i ) − 2 n 2 ∑ i = 1 n ∑ j = 1 n K ( x i , x j ) + 1 n 2 ∑ s = 1 n ∑ t = 1 n K ( x s , x t ) = 1 n ∑ i = 1 n K ( x i , x i ) − 1 n 2 ∑ i = 1 n ∑ j = 1 n K ( x i , x j ) \begin{aligned} \sigma_{\phi}^{2} &=\frac{1}{n} \sum\limits_{i=1}^{n}\left\|\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi}\right\|^{2}\\ &=\frac{1}{n} \sum\limits_{i=1}^{n}\left(K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)-\frac{2}{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)+\frac{1}{n^{2}} \sum\limits_{s=1}^{n} \sum\limits_{t=1}^{n} K\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right)\right)\\ &=\frac{1}{n} \sum\limits_{i=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)-\frac{2}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)+\frac{1}{n^{2}} \sum\limits_{s=1}^{n} \sum\limits_{t=1}^{n} K\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right)\\ &=\frac{1}{n} \sum\limits_{i=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)-\frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) \end{aligned} σϕ2=n1i=1nϕ(xi)μϕ2=n1i=1n(K(xi,xi)n2j=1nK(xi,xj)+n21s=1nt=1nK(xs,xt))=n1i=1nK(xi,xi)n22i=1nj=1nK(xi,xj)+n21s=1nt=1nK(xs,xt)=n1i=1nK(xi,xi)n21i=1nj=1nK(xi,xj)

    1 n ∑ i = 1 n K ( x i , x i ) \frac{1}{n} \sum\limits_{i=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right) n1i=1nK(xi,xi) K \mathbf{K} K 对角线平均值, 1 n 2 ∑ i = 1 n ∑ j = 1 n K ( x i , x j ) \frac{1}{n^{2}} \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) n21i=1nj=1nK(xi,xj) K \mathbf{K} K 所有元的平均值。

  • 中心化核矩阵:令 ϕ ^ ( x i ) = ϕ ( x i ) − μ ϕ \hat{\phi}\left(\mathbf{x}_{i}\right)=\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi} ϕ^(xi)=ϕ(xi)μϕ

    中心核函数 K ^ ( x i , x j ) = ϕ ^ ( x i ) T ϕ ^ ( x j ) \hat{K}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) =\hat{\phi}\left(\mathbf{x}_{i}\right)^{T} \hat{\phi}\left(\mathbf{x}_{j}\right) K^(xi,xj)=ϕ^(xi)Tϕ^(xj)
    K ^ ( x i , x j ) = ( ϕ ( x i ) − μ ϕ ) T ( ϕ ( x j ) − μ ϕ ) = ϕ ( x i ) T ϕ ( x j ) − ϕ ( x i ) T μ ϕ − ϕ ( x j ) T μ ϕ + μ ϕ T μ ϕ = K ( x i , x j ) − 1 n ∑ k = 1 n ϕ ( x i ) T ϕ ( x k ) − 1 n ∑ k = 1 n ϕ ( x j ) T ϕ ( x k ) + ∥ μ ϕ ∥ 2 = K ( x i , x j ) − 1 n ∑ k = 1 n K ( x i , x k ) − 1 n ∑ k = 1 n K ( x j , x k ) + 1 n 2 ∑ s = 1 n ∑ t = 1 n K ( x s , x t ) \begin{aligned} \hat{K}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right) &=\left(\phi\left(\mathbf{x}_{i}\right)-\boldsymbol{\mu}_{\phi}\right)^{T}\left(\phi\left(\mathbf{x}_{j}\right)-\boldsymbol{\mu}_{\phi}\right) \\ &=\phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right)-\phi\left(\mathbf{x}_{i}\right)^{T} \boldsymbol{\mu}_{\phi}-\phi\left(\mathbf{x}_{j}\right)^{T} \boldsymbol{\mu}_{\phi}+\boldsymbol{\mu}_{\phi}^{T} \boldsymbol{\mu}_{\phi}\\ &=K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)-\frac{1}{n} \sum_{k=1}^{n} \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{k}\right)-\frac{1}{n} \sum_{k=1}^{n} \phi\left(\mathbf{x}_{j}\right)^{T} \phi\left(\mathbf{x}_{k}\right)+\left\|\boldsymbol{\mu}_{\phi}\right\|^{2} \\ &=K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)-\frac{1}{n} \sum_{k=1}^{n} K\left(\mathbf{x}_{i}, \mathbf{x}_{k}\right)-\frac{1}{n} \sum_{k=1}^{n} K\left(\mathbf{x}_{j}, \mathbf{x}_{k}\right)+\frac{1}{n^{2}} \sum_{s=1}^{n} \sum_{t=1}^{n} K\left(\mathbf{x}_{s}, \mathbf{x}_{t}\right) \end{aligned} K^(xi,xj)=(ϕ(xi)μϕ)T(ϕ(xj)μϕ)=ϕ(xi)Tϕ(xj)ϕ(xi)Tμϕϕ(xj)Tμϕ+μϕTμϕ=K(xi,xj)n1k=1nϕ(xi)Tϕ(xk)n1k=1nϕ(xj)Tϕ(xk)+μϕ2=K(xi,xj)n1k=1nK(xi,xk)n1k=1nK(xj,xk)+n21s=1nt=1nK(xs,xt)

    K ^ = K − 1 n 1 n × n K − 1 n K 1 n × n + 1 n 2 1 n × n K 1 n × n = ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) \begin{aligned} \hat{\mathbf{K}} &=\mathbf{K}-\frac{1}{n} \mathbf{1}_{n \times n} \mathbf{K}-\frac{1}{n} \mathbf{K} \mathbf{1}_{n \times n}+\frac{1}{n^{2}} \mathbf{1}_{n \times n} \mathbf{K} \mathbf{1}_{n \times n} \\ &=\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \mathbf{K}\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \end{aligned} K^=Kn11n×nKn1K1n×n+n211n×nK1n×n=(In11n×n)K(In11n×n)
    注意 1 n × n \mathbf{1}_{n \times n} 1n×n 为全1矩阵,左乘之后每个元素为之前所在列的列和,右乘之和每个元素为之前所在行的行和,左右都乘之后每个元素即为原来所以元素之后。

  • 归一化核矩阵

    K n ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) ∥ ϕ ( x i ) ∥ ⋅ ∥ ϕ ( x j ) ∥ = K ( x i , x j ) K ( x i , x i ) ⋅ K ( x j , x j ) \mathbf{K}_{n}\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)=\frac{\phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right)}{\left\|\phi\left(\mathbf{x}_{i}\right)\right\| \cdot\left\|\phi\left(\mathbf{x}_{j}\right)\right\|}=\frac{K\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)}{\sqrt{K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right) \cdot K\left(\mathbf{x}_{j}, \mathbf{x}_{j}\right)}} Kn(xi,xj)=ϕ(xi)ϕ(xj)ϕ(xi)Tϕ(xj)=K(xi,xi)K(xj,xj) K(xi,xj)

    W = diag ⁡ ( K ) = ( K ( x 1 , x 1 ) 0 ⋯ 0 0 K ( x 2 , x 2 ) ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ K ( x n , x n ) ) \mathbf{W}=\operatorname{diag}(\mathbf{K})=\left(\begin{array}{cccc} K\left(\mathbf{x}_{1}, \mathbf{x}_{1}\right) & 0 & \cdots & 0 \\ 0 & K\left(\mathbf{x}_{2}, \mathbf{x}_{2}\right) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & K\left(\mathbf{x}_{n}, \mathbf{x}_{n}\right) \end{array}\right) W=diag(K)=K(x1,x1)000K(x2,x2)000K(xn,xn),则 W − 1 / 2 ( x i , x i ) = 1 K ( x i , x i ) \mathbf{W}^{-1 / 2}\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)=\frac{1}{\sqrt{K\left(\mathbf{x}_{i}, \mathbf{x}_{i}\right)}} W1/2(xi,xi)=K(xi,xi) 1

    K n = W − 1 / 2 ⋅ K ⋅ W − 1 / 2 \mathbf{K}_{n}=\mathbf{W}^{-1 / 2} \cdot \mathbf{K} \cdot \mathbf{W}^{-1 / 2} Kn=W1/2KW1/2

    矩阵左乘右乘对角阵的性质。

5.4 复杂对象的核

5.4.1 字串的谱核

考虑字符集 Σ \Sigma Σ (有限),定义 l l l-谱特征映射:
ϕ l : Σ ∗ → R ∣ Σ ∣ l , ∀ x ∈ Σ ∗ , ϕ l ( x ) = ( ⋯   , # ( α ) , ⋯   ) T \phi_l: \Sigma^{*} \rightarrow \mathbb{R}^{|\Sigma|^l},\forall \mathbf{x} \in \Sigma^{*},\phi_l(\mathbf{x})=\left ( \cdots,\#(\alpha),\cdots \right )^T ϕl:ΣRΣl,xΣ,ϕl(x)=(,#(α),)T
其中 # ( α ) \#(\alpha) #(α) 代表长度为 l l l 的子字串在 x \mathbf{x} x 中出现的次数。

l l l-谱核函数: K l : Σ ∗ × Σ ∗ → R \mathbf{K}_l:\Sigma^{*} \times \Sigma^{*} \to \mathbb{R} Kl:Σ×ΣR K l ( x , y ) = ϕ l ( x ) T ϕ l ( y ) \mathbf{K}_l (\mathbf{x},\mathbf{y})=\phi_l(\mathbf{x})^{T} \phi_l(\mathbf{y}) Kl(x,y)=ϕl(x)Tϕl(y)

谱核函数:计算 l = 0 l=0 l=0 l = ∞ l=\infty l=

5.4.2 图顶点的扩散核

  • 图:图 G = ( V , E ) G=(V,E) G=(V,E) 是指一个集合对,其中 V = { v 1 , ⋯   , v n } V=\{v_1,\cdots,v_n\} V={v1,,vn} 为顶点集, E = { ( v i , v j ) } E=\{(v_i,v_j)\} E={(vi,vj)} 为边集。现只考虑无向简单(没有自己到自己的边)图。

  • 邻接矩阵:图的邻接矩阵 A ( G ) : = [ A i j ] n × n A(G):=[A_{ij}]_{n\times n} A(G):=[Aij]n×n,其中 A i j = { 1 , ( v i , v j ) ∈ E 0 , ( v i , v j ) ∉ E A_{ij}=\left\{\begin{matrix} 1, (v_i,v_j)\in E\\ 0, (v_i,v_j) \notin E \end{matrix}\right. Aij={1,(vi,vj)E0,(vi,vj)/E

  • 度矩阵: Δ ( G ) : = d i a g ( d 1 , ⋯   , d n ) \Delta (G):=diag(d_1,\cdots,d_n) Δ(G):=diag(d1,,dn),其中 d i d_i di 代表顶点 v i v_i vi 的度,即与 v i v_i vi 相连的边的数目。

  • 拉普拉斯矩阵: L ( G ) : = A ( G ) − Δ ( G ) L(G):=A(G)-\Delta(G) L(G):=A(G)Δ(G)

    负拉普拉斯矩阵: L ( G ) : = − L ( G ) L(G):=-L(G) L(G):=L(G)

    它们是实对称

常用图的对称相似性矩阵 S \mathbf{S} S 是指 A ( G ) A(G) A(G) L ( G ) L(G) L(G) − L ( G ) -L(G) L(G)

问题:如何定义图顶点的核函数?( S \mathbf{S} S 并不一定是半正定)

- 幂核函数

S t \mathbf{S}^t St 作为核矩阵, S \mathbf{S} S 是对称的, t t t 为正整数

考虑 S 2 \mathbf{S}^2 S2 S 2 ( x i , x j ) = ∑ k = 1 n S i k S k j \mathbf{S}^2(x_i,x_j)=\sum\limits_{k=1}^{n}S_{ik}S_{kj} S2(xi,xj)=k=1nSikSkj

此公式说明 S 2 \mathbf{S}^2 S2 ( S l \mathbf{S}^l Sl) 的几何意义:顶点间长度为 2 2 2 ( l ) (l) (l) 的路径,描述顶点的相似性。

考虑 S l \mathbf{S}^l Sl 的特征值:设 S \mathbf{S} S 的特征值为 λ 1 , ⋯   , λ n ∈ R \lambda_1,\cdots,\lambda_n \in \mathbb{R} λ1,,λnR,则
S = U ( λ 1 0 ⋯ 0 0 λ 2 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ λ n ) U T = ∑ i = 1 n λ i u i u i T \mathbf{S}=\mathbf{U}\left(\begin{array}{cccc} \lambda_{1} & 0 & \cdots & 0 \\ 0 & \lambda_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{n} \end{array}\right)\mathbf{U}^{T}=\sum_{i=1}^{n} \lambda_{i}\mathbf{u}_{i} \mathbf{u}_{i}^{T} S=Uλ1000λ2000λnUT=i=1nλiuiuiT
其中 U \mathbf{U} U 是以相应特征向量为列的正交矩阵, U = ( ∣ ∣ ∣ u 1 u 2 ⋯ u n ∣ ∣ ∣ ) \mathbf{U}=\left(\begin{array}{cccc} \mid & \mid & & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} & \cdots & \mathbf{u}_{n} \\ \mid & \mid & & \mid \end{array}\right) U=u1u2un
S l = U ( λ 1 l 0 ⋯ 0 0 λ 2 l ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ λ n l ) U T \mathbf{S}^l=\mathbf{U}\left(\begin{array}{cccc} \lambda_{1}^l & 0 & \cdots & 0 \\ 0 & \lambda_{2}^l & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_{n}^l \end{array}\right)\mathbf{U}^{T} Sl=Uλ1l000λ2l000λnlUT
λ 1 l , ⋯   , λ n l \lambda_1^l,\cdots,\lambda_n^l λ1l,,λnl S l \mathbf{S}^l Sl 的特征值。

故若 l l l 是偶数, S l \mathbf{S}^l Sl 半正定。

- 指数扩散核函数

K : = e β S \mathbf{K}:=e^{\beta \mathbf{S}} K:=eβS 为核矩阵,其中 β > 0 \beta >0 β>0 为阻尼系数。(泰勒展开: e β x = ∑ l = 0 ∞ 1 l ! β l x l e^{\beta x}=\sum\limits_{l=0}^{\infin}\frac{1}{l!}\beta^{l}x^l eβx=l=0l!1βlxl
e β S = ∑ l = 0 ∞ 1 l ! β ′ S l = I + β S + 1 2 ! β 2 S 2 + 1 3 ! β 3 S 3 + ⋯ = ( ∑ i = 1 n u i u i T ) + ( ∑ i = 1 n u i β λ i u i T ) + ( ∑ i = 1 n u i 1 2 ! β 2 λ i 2 u i T ) + ⋯ = ∑ i = 1 n u i ( 1 + β λ i + 1 2 ! β 2 λ i 2 + ⋯   ) u i T = ∑ i = 1 n u i e β λ i u i T = U ( e β λ 1 0 ⋯ 0 0 e β λ 2 ⋯ 0 ⋮ ⋮ ⋱ 0 0 0 ⋯ e β λ n ) U T \begin{aligned} e^{\beta \mathbf{S}} &=\sum_{l=0}^{\infty} \frac{1}{l !} \beta^{\prime} \mathbf{S}^{l} \\ &=\mathbf{I}+\beta \mathbf{S}+\frac{1}{2 !} \beta^{2} \mathbf{S}^{2}+\frac{1}{3 !} \beta^{3} \mathbf{S}^{3}+\cdots\\ &=\left(\sum_{i=1}^{n} \mathbf{u}_{i} \mathbf{u}_{i}^{T}\right)+\left(\sum_{i=1}^{n} \mathbf{u}_{i} \beta \lambda_{i} \mathbf{u}_{i}^{T}\right)+\left(\sum_{i=1}^{n} \mathbf{u}_{i} \frac{1}{2 !} \beta^{2} \lambda_{i}^{2} \mathbf{u}_{i}^{T}\right)+\cdots \\ &=\sum_{i=1}^{n} \mathbf{u}_{i}\left(1+\beta \lambda_{i}+\frac{1}{2 !} \beta^{2} \lambda_{i}^{2}+\cdots\right) \mathbf{u}_{i}^{T} \\ &=\sum_{i=1}^{n} \mathbf{u}_{i} e ^{\beta \lambda_{i}} \mathbf{u}_{i}^{T} \\ &=\mathbf{U}\left(\begin{array}{cccc} e ^{\beta \lambda_{1}} & 0 & \cdots & 0 \\ 0 & e ^{\beta \lambda_{2}} & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & e ^{\beta \lambda_{n}} \end{array}\right) \mathbf{U}^{T} \end{aligned} eβS=l=0l!1βSl=I+βS+2!1β2S2+3!1β3S3+=(i=1nuiuiT)+(i=1nuiβλiuiT)+(i=1nui2!1β2λi2uiT)+=i=1nui(1+βλi+2!1β2λi2+)uiT=i=1nuieβλiuiT=Ueβλ1000eβλ20000eβλnUT
K \mathbf{K} K 的特征值 e β λ 1 , ⋯   , e β λ n e ^{\beta \lambda_{1}},\cdots,e ^{\beta \lambda_{n}} eβλ1,,eβλn 完全非负, K \mathbf{K} K 为半正定。

- 纽因曼扩散核函数

K = ∑ l = 0 ∞ β l S l \mathbf{K}=\sum\limits_{l=0}^{\infty} \beta^l \mathbf{S}^l K=l=0βlSl 为核矩阵,注意到
K = I + β S + β 2 S 2 + β 3 S 3 + ⋯ = I + β S ( I + β S + β 2 S 2 + ⋯   ) = I + β S K \begin{aligned} \mathbf{K} &=\mathbf{I}+\beta \mathbf{S}+\beta^{2} \mathbf{S}^{2}+\beta^{3} \mathbf{S}^{3}+\cdots \\ &=\mathbf{I}+\beta \mathbf{S}\left(\mathbf{I}+\beta \mathbf{S}+\beta^{2} \mathbf{S}^{2}+\cdots\right) \\ &=\mathbf{I}+\beta \mathbf{S} \mathbf{K} \end{aligned} K=I+βS+β2S2+β3S3+=I+βS(I+βS+β2S2+)=I+βSK

K − β S K = I ( I − β S ) K = I K = ( I − β S ) − 1 \begin{aligned} \mathbf{K}-\beta \mathbf{S} \mathbf{K} &=\mathbf{I} \\ (\mathbf{I}-\beta \mathbf{S}) \mathbf{K} &=\mathbf{I} \\ \mathbf{K} &=(\mathbf{I}-\beta \mathbf{S})^{-1} \end{aligned} KβSK(IβS)KK=I=I=(IβS)1
I − β S \mathbf{I}-\beta \mathbf{S} IβS 可逆的前提下
K = ( U U T − U ( β Λ ) U T ) − 1 = ( U ( I − β Λ ) U T ) − 1 = U ( I − β Λ ) − 1 U T = U ( 1 1 − β λ 1 0 ⋯ 0 0 1 1 − β λ 2 ⋯ 0 ⋮ ⋮ ⋱ 0 0 0 ⋯ 1 1 − β λ n ) U T \begin{aligned} \mathbf{K} &=\left(\mathbf{U} \mathbf{U}^{T}-\mathbf{U}(\beta \mathbf{\Lambda}) \mathbf{U}^{T}\right)^{-1} \\ &=\left(\mathbf{U}(\mathbf{I}-\beta \mathbf{\Lambda}) \mathbf{U}^{T}\right)^{-1} \\ &=\mathbf{U}(\mathbf{I}-\beta \mathbf{\Lambda})^{-1} \mathbf{U}^{T}\\ &=\mathbf{U}\left(\begin{array}{cccc} \frac{1}{1-\beta\lambda_1} & 0 & \cdots & 0 \\ 0 & \frac{1}{1-\beta\lambda_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & \frac{1}{1-\beta\lambda_n} \end{array}\right) \mathbf{U}^{T} \end{aligned} K=(UUTU(βΛ)UT)1=(U(IβΛ)UT)1=U(IβΛ)1UT=U1βλ110001βλ2100001βλn1UT
要想其半正定,故有:
( 1 − β λ i ) − 1 ≥ 0 1 − β λ i ≥ 0 β λ i ≤ 1 \begin{aligned} \left(1-\beta \lambda_{i}\right)^{-1} & \geq 0 \\ 1-\beta \lambda_{i} & \geq 0 \\ \beta \lambda_{i} & \leq 1 \end{aligned} (1βλi)11βλiβλi001

Chapter 7:降维

PCA:主元分析

7.1 背景

D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D=x1x2xnX1x11x21xn1X2x12x22xn2Xdx1dx2dxnd

对象: x 1 T , ⋯   , x n T ∈ R d \mathbf{x}_{1}^T,\cdots,\mathbf{x}_n^T \in \mathbb{R}^d x1T,,xnTRd ∀ x ∈ R d , \forall \mathbf{x} \in \mathbb{R}^d, xRd, x = ( x 1 , ⋯   , x d ) T = ∑ i = 1 d x i e i \mathbf{x}=(x_1,\cdots,x_d)^T= \sum\limits_{i=1}^{d}x_i \mathbf{e}_i x=(x1,,xd)T=i=1dxiei

其中, e i = ( 0 , ⋯   , 1 , ⋯   , 0 ) T ∈ R d \mathbf{e}_i=(0,\cdots,1,\cdots,0)^T\in\mathbb{R}^d ei=(0,,1,,0)TRd,i-坐标

设另有单位正交基 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n x = ∑ i = 1 d a i u i , a i ∈ R \mathbf{x}=\sum\limits_{i=1}^{d}a_i \mathbf{u}_i,a_i \in \mathbb{R} x=i=1daiui,aiR u i T u j = { 1 , i = j 0 , i ≠ j \mathbf{u}_i^T \mathbf{u}_j =\left\{\begin{matrix} 1,i=j\\ 0,i\ne j \end{matrix}\right. uiTuj={1,i=j0,i=j

∀ r : 1 ≤ r ≤ d , x = a 1 u 1 + ⋯ + a r u r ⏟ 投影 + a r + 1 u r + 1 + ⋯ + a d u d ⏟ 误差 \forall r:1\le r\le d, \mathbf{x}=\underbrace{a_1 \mathbf{u}_1+\cdots+a_r \mathbf{u}_r}_{\text{投影}}+ \underbrace{a_{r+1} \mathbf{u}_{r+1}+\cdots+a_d \mathbf{u}_d}_{\text{误差}} r:1rd,x=投影 a1u1++arur+误差 ar+1ur+1++adud

r r r 项是投影,后面是投影误差。

目标:对于给定 D D D,寻找最优 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n,使得 D D D 在其前 r r r 维子空间的投影是对 D D D 的“最佳近似”,即投影之后“误差最小”。

7.2 主元分析:

7.2.1 最佳直线近似

(一阶主元分析)(r=1)

目标:寻找 u 1 \mathbf{u}_1 u1,不妨记为 u = ( u 1 , ⋯   , u d ) T \mathbf{u}=(u_1,\cdots,u_d)^T u=(u1,,ud)T

假设: ∣ ∣ u ∣ ∣ = u T u = 1 ||\mathbf{u}||=\mathbf{u}^T\mathbf{u}=1 u=uTu=1 μ ^ = 1 n ∑ i = 1 n x i = 0 , ∈ R d \hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^n\mathbf{x}_i=\mathbf{0},\in \mathbb{R}^{d} μ^=n1i=1nxi=0,Rd

∀ x i ( i = 1 , ⋯   , n ) \forall \mathbf{x}_i(i=1,\cdots,n) xi(i=1,,n) x i \mathbf{x}_i xi 沿 u \mathbf{u} u 方向投影是:
x i ′ = ( u T x i u T u ) u = ( u T x i ) u = a i u , a i = u T x i \mathbf{x}_{i}^{\prime}=\left(\frac{\mathbf{u}^{T} \mathbf{x}_{i}}{\mathbf{u}^{T} \mathbf{u}}\right) \mathbf{u}=\left(\mathbf{u}^{T} \mathbf{x}_{i}\right) \mathbf{u}=a_{i} \mathbf{u},a_{i}=\mathbf{u}^{T} \mathbf{x}_{i} xi=(uTuuTxi)u=(uTxi)u=aiu,ai=uTxi
μ ^ = 0 ⇒ \hat{\boldsymbol{\mu}}=\mathbf{0}\Rightarrow μ^=0 μ ^ \hat{\boldsymbol{\mu}} μ^ u \mathbf{u} u 上投影是0; x 1 ′ , ⋯   , x n ′ \mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime} x1,,xn 的平均值为0 。

P r o j ( m e a n ( D ) ) = m e a n P r o j ( D ) Proj(mean(D))=mean{Proj(D)} Proj(mean(D))=meanProj(D)

考察 x 1 ′ , ⋯   , x n ′ \mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime} x1,,xn 沿 u \mathbf{u} u 方向的样本方差:
σ u 2 = 1 n ∑ i = 1 n ( a i − μ u ) 2 = 1 n ∑ i = 1 n ( u T x i ) 2 = 1 n ∑ i = 1 n u T ( x i x i T ) u = u T ( 1 n ∑ i = 1 n x i x i T ) u = u T Σ u \begin{aligned} \sigma_{\mathbf{u}}^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(a_{i}-\mu_{\mathbf{u}}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{u}^{T} \mathbf{x}_{i}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{u}^{T}\left(\mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T}\left(\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \end{aligned} σu2=n1i=1n(aiμu)2=n1i=1n(uTxi)2=n1i=1nuT(xixiT)u=uT(n1i=1nxixiT)u=uTΣu
Σ \mathbf{\Sigma} Σ 是样本协方差矩阵。

目标
max ⁡ u u T Σ u s.t u T u − 1 = 0 \begin{array}{ll} \max\limits_{\mathbf{u}} & \mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \\ \text{s.t} & \mathbf{u}^T\mathbf{u}-1=0 \end{array} umaxs.tuTΣuuTu1=0
应用 Lagrangian 乘数法:
max ⁡ u J ( u ) = u T Σ u − λ ( u T u − 1 ) \max \limits_{\mathbf{u}} J(\mathbf{u})=\mathbf{u}^{T} \Sigma \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right) umaxJ(u)=uTΣuλ(uTu1)
求偏导:
∂ ∂ u J ( u ) = 0 ∂ ∂ u ( u T Σ u − λ ( u T u − 1 ) ) = 0 2 Σ u − 2 λ u = 0 Σ u = λ u \begin{aligned} \frac{\partial}{\partial \mathbf{u}} J(\mathbf{u}) &=\mathbf{0} \\ \frac{\partial}{\partial \mathbf{u}}\left(\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right)\right) &=\mathbf{0} \\ 2 \mathbf{\Sigma} \mathbf{u}-2 \lambda \mathbf{u} &=\mathbf{0} \\ \mathbf{\Sigma} \mathbf{u} &=\lambda \mathbf{u} \end{aligned} uJ(u)u(uTΣuλ(uTu1))2Σu2λuΣu=0=0=0=λu
注意到: u T Σ u = u T λ u = λ \mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}=\mathbf{u}^{T} \lambda \mathbf{u}=\lambda uTΣu=uTλu=λ

故优化问题的解 λ \lambda λ 选取 Σ \mathbf{\Sigma} Σ 最大特征值, u \mathbf{u} u 选与 λ \lambda λ 相应的单位特征向量。

问题:上述问题使得 σ u 2 \sigma_{\mathbf{u}}^{2} σu2 最大的 u \mathbf{u} u 能否使投影误差最小?

定义平均平方误差(Minimum Squared Error,MSE):
M S E ( u ) = 1 n ∑ i = 1 n ∥ x i − x i ′ ∥ 2 = 1 n ∑ i = 1 n ( x i − x i ′ ) T ( x i − x i ′ ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 x i T x i ′ + ( x i ′ ) T x i ′ ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 x i T ( u T x i ) u + [ ( u T x i ) u ] T [ ( u T x i ) u ] ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 ( u T x i ) x i T u + ( u T x i ) ( x i T u ) u T u ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − u T x i x i T u ) = 1 n ∑ i = 1 n ∥ x i ∥ 2 − u T Σ u = v a r ( D ) − σ u 2 \begin{aligned} M S E(\mathbf{u}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right)^{T}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} \mathbf{x}_{i}^{\prime}+\left(\mathbf{x}_{i}^{\prime}\right)^{T} \mathbf{x}_{i}^{\prime}\right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}+\left[(\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right]^{T} \left[ (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right] \right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{x}_{i}^{T} \mathbf{u}+(\mathbf{u}^{T} \mathbf{x}_{i})(\mathbf{x}_{i}^{T} \mathbf{u})\mathbf{u}^{T}\mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{x}_{i}\mathbf{x}_{i}^{T} \mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}\\ &= var(D)-\sigma_{\mathbf{u}}^{2} \end{aligned} MSE(u)=n1i=1nxixi2=n1i=1n(xixi)T(xixi)=n1i=1n(xi22xiTxi+(xi)Txi)=n1i=1n(xi22xiT(uTxi)u+[(uTxi)u]T[(uTxi)u])=n1i=1n(xi22(uTxi)xiTu+(uTxi)(xiTu)uTu)=n1i=1n(xi2uTxixiTu)=n1i=1nxi2uTΣu=var(D)σu2
上式表明: v a r ( D ) = σ u 2 + M S E var(D)=\sigma_{\mathbf{u}}^{2}+MSE var(D)=σu2+MSE

u \mathbf{u} u 的几何意义: R d \mathbb{R}^d Rd 中使得数据沿其方向投影后方差最大的同时,MSE 最小的直线方向。

u \mathbf{u} u 被称为一阶主元(first principal component)

7.2.2 最佳2-维近似

(二阶主元分析:r=2)

假设 u 1 \mathbf{u}_1 u1 已经找到,即 Σ \mathbf{\Sigma} Σ 的最大特征值对应的特征向量。

目标:寻找 u 2 \mathbf{u}_2 u2 ,简记为 v \mathbf{v} v,使得: v T u 1 = 0 , v T v = 1 \mathbf{v}^{T} \mathbf{u}_{1}=0,\mathbf{v}^{T} \mathbf{v} =1 vTu1=0,vTv=1

考虑 x i \mathbf{x}_{i} xi 沿 v \mathbf{v} v 方向投影的方差:
max ⁡ u σ v 2 = v T Σ v s.t v T v − 1 = 0 v T u 1 = 0 \begin{array}{ll} \max\limits_{\mathbf{u}} & \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} \\ \text{s.t} & \mathbf{v}^T\mathbf{v}-1=0\\ & \mathbf{v}^{T} \mathbf{u}_{1}=0 \end{array} umaxs.tσv2=vTΣvvTv1=0vTu1=0
定义: J ( v ) = v T Σ v − α ( v T v − 1 ) − β ( v T u 1 − 0 ) J(\mathbf{v})=\mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v}-\alpha\left(\mathbf{v}^{T} \mathbf{v}-1\right)-\beta\left(\mathbf{v}^{T} \mathbf{u}_{1}-0\right) J(v)=vTΣvα(vTv1)β(vTu10)

v \mathbf{v} v 求偏导得:
2 Σ v − 2 α v − β u 1 = 0 2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}-\beta \mathbf{u}_{1}=\mathbf{0} 2Σv2αvβu1=0
两边同乘 u 1 T \mathbf{u}_{1}^{T} u1T
2 u 1 T Σ v − 2 α u 1 T v − β u 1 T u 1 = 0 2 u 1 T Σ v − β = 0 2 v T Σ u 1 − β = 0 2 v T λ 1 u 1 − β = 0 β = 0 \begin{aligned} 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-2 \alpha \mathbf{u}_{1}^{T}\mathbf{v}-\beta \mathbf{u}_{1}^{T}\mathbf{u}_{1} &=0 \\ 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-\beta &= 0\\ 2 \mathbf{v}^{T}\Sigma \mathbf{u}_{1}-\beta &= 0\\ 2 \mathbf{v}^{T}\lambda_1 \mathbf{u}_{1}-\beta &= 0\\ \beta &= 0 \end{aligned} 2u1TΣv2αu1Tvβu1Tu12u1TΣvβ2vTΣu1β2vTλ1u1ββ=0=0=0=0=0
再代入到原式:
2 Σ v − 2 α v = 0 Σ v = α v 2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}=\mathbf{0}\\ \Sigma \mathbf{v}=\alpha \mathbf{v} 2Σv2αv=0Σv=αv
v \mathbf{v} v 也是 Σ \mathbf{\Sigma} Σ 的特征向量。

σ v 2 = v T Σ v = α \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} =\alpha σv2=vTΣv=α,故 α \alpha α 应取 Σ \mathbf{\Sigma} Σ (第二大)的特征向量。

问题1:上述求得的 v \mathbf{v} v (即 u 2 \mathbf{u}_2 u2 ),与 u 1 \mathbf{u}_1 u1 一起考虑,能否使 D D D s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影总方差最大?

x i = a i 1 u 1 + a i 2 u 2 ⏟ 投 影 + ⋯ \mathbf{x}_i=\underbrace{a_{i1} \mathbf{u}_1+a_{i2}\mathbf{u}_2}_{投影}+\cdots xi= ai1u1+ai2u2+

x i \mathbf{x}_i xi s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影坐标: a i = ( a i 1 , a i 2 ) T = ( u 1 T x i , u 2 T x i ) T \mathbf{a}_{i}=(a_{i1},a_{i2})^T=(\mathbf{u}_1^{T}\mathbf{x}_i,\mathbf{u}_2^{T}\mathbf{x}_i)^{T} ai=(ai1,ai2)T=(u1Txi,u2Txi)T

U 2 = ( ∣ ∣ u 1 u 2 ∣ ∣ ) \mathbf{U}_{2}=\left(\begin{array}{cc} \mid & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} \\ \mid & \mid \end{array}\right) U2=u1u2,则 a i = U 2 T x i \mathbf{a}_{i}=\mathbf{U}_{2}^{T} \mathbf{x}_{i} ai=U2Txi

投影总方差为:
var ⁡ ( A ) = 1 n ∑ i = 1 n ∥ a i − 0 ∥ 2 = 1 n ∑ i = 1 n ( U 2 T x i ) T ( U 2 T x i ) = 1 n ∑ i = 1 n x i T ( U 2 U 2 T ) x i = 1 n ∑ i = 1 n x i T ( u 1 u 1 T + u 2 u 2 T ) x i = u 1 T Σ u 1 + u 2 T Σ u 2 = λ 1 + λ 2 \begin{aligned} \operatorname{var}(\mathbf{A}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{a}_{i}-\mathbf{0}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right)^{T}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left( \mathbf{u}_{1}\mathbf{u}_{1}^T + \mathbf{u}_{2}\mathbf{u}_{2}^T \right) \mathbf{x}_{i}\\ &=\mathbf{u}_{1}^T\mathbf{\Sigma} \mathbf{u}_{1} + \mathbf{u}_{2}^T\mathbf{\Sigma} \mathbf{u}_{2}\\ &= \lambda_1 +\lambda_2 \end{aligned} var(A)=n1i=1nai02=n1i=1n(U2Txi)T(U2Txi)=n1i=1nxiT(U2U2T)xi=n1i=1nxiT(u1u1T+u2u2T)xi=u1TΣu1+u2TΣu2=λ1+λ2
问题2:平均平方误差是否最小?

其中, x i ′ = U 2 U 2 T x i \mathbf{x}_{i}^{\prime}=\mathbf{U}_{2}\mathbf{U}_{2}^{T} \mathbf{x}_{i} xi=U2U2Txi
M S E = 1 n ∑ i = 1 n ∥ x i − x i ′ ∥ 2 = 1 n ∑ i = 1 n ∥ x i ∥ 2 − 1 n ∑ i = 1 n x i T ( U 2 U 2 T ) x i = v a r ( D ) − λ 1 − λ 2 \begin{aligned} M S E &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2} - \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &= var(D) - \lambda_1 - \lambda_2 \end{aligned} MSE=n1i=1nxixi2=n1i=1nxi2n1i=1nxiT(U2U2T)xi=var(D)λ1λ2
结论:

  1. Σ \mathbf{\Sigma} Σ 的前 r r r 个特征值的和 λ 1 + ⋯ + λ r ( λ 1 ≥ ⋯ ≥ λ r ) \lambda_1+\cdots+\lambda_r(\lambda_1\ge\cdots\ge\lambda_r) λ1++λr(λ1λr) 给出最大投影总方差;
  2. v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r \lambda_i var(D)i=1rλi 给出最小MSE;
  3. λ 1 , ⋯   , λ r \lambda_1,\cdots,\lambda_r λ1,,λr 相应的特征向量 u 1 , ⋯ u r \mathbf{u}_{1},\cdots\mathbf{u}_{r} u1,ur 张成 r r r - 阶主元。

7.2.3 推广

Σ d × d \Sigma_{d\times d} Σd×d λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1λ2λd,中心化

∑ i = 1 r λ i \sum\limits_{i=1}^r\lambda_i i=1rλi:最大投影总方差;

v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r\lambda_i var(D)i=1rλi:最小MSE

实践: 如何选取适当的 r r r,考虑比值 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1rλi 与给定阈值 α \alpha α 比较

算法 7.1 PCA:

输入: D D D α \alpha α

输出: A A A (降维后)

  1. μ = 1 n ∑ i = 1 r x i \boldsymbol{\mu} = \frac{1}{n}\sum\limits_{i=1}^r\mathbf{x}_i μ=n1i=1rxi;
  2. Z = D − 1 ⋅ μ T \mathbf{Z}=\mathbf{D}-\mathbf{1}\cdot \boldsymbol{\mu} ^T Z=D1μT;
  3. Σ = 1 n ( Z T Z ) \mathbf{\Sigma}=\frac{1}{n}(\mathbf{Z}^T\mathbf{Z}) Σ=n1(ZTZ);
  4. λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1λ2λd ⟵ Σ \longleftarrow \mathbf{\Sigma} Σ 的特征值(降序排列);
  5. u 1 , u 2 , ⋯   , u d \mathbf{u}_1,\mathbf{u}_2,\cdots,\mathbf{u}_d u1,u2,,ud ⟵ Σ \longleftarrow \mathbf{\Sigma} Σ 的特征向量(单位正交);
  6. 计算 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1rλi,选取其比值超过 α \alpha α 最小的 r r r
  7. U r = ( u 1 , u 2 , ⋯   , u r ) \mathbf{U}_r=(\mathbf{u}_1,\mathbf{u}_2,\cdots,\mathbf{u}_r) Ur=(u1,u2,,ur);
  8. A = { a i ∣ a i = U r T x i , i = 1 , ⋯   , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{U}_r^T\mathbf{x}_i, i=1,\cdots,n\} A={aiai=UrTxi,i=1,,n}

7.2.3 Kernel PCA:核主元分析

ϕ : I → F ⊆ R d \phi:\mathcal{I}\to \mathcal{F}\subseteq \mathbb{R}^d ϕ:IFRd

K : I × I → R K:\mathcal{I}\times\mathcal{I}\to \mathbb{R} K:I×IR

K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) K(\mathbf{x}_i,\mathbf{x}_j)=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j) K(xi,xj)=ϕT(xi)ϕ(xj)

已知 K = [ K ( x i , x j ) ] n × n \mathbf{K}=[K(\mathbf{x}_i,\mathbf{x}_j)]_{n\times n} K=[K(xi,xj)]n×n Σ ϕ = 1 n ∑ i = 1 n ϕ ( x i ) ϕ ( x i ) T \mathbf{\Sigma}_{\phi}=\frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T Σϕ=n1i=1nϕ(xi)ϕ(xi)T

对象 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯   , ϕ ( x n ) ∈ R d \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n)\in \mathbb{R}^d ϕ(x1),ϕ(x2),,ϕ(xn)Rd,假设 1 n ∑ i n ϕ ( x i ) = 0 \frac{1}{n}\sum\limits_{i}^{n}\phi(\mathbf{x}_i)=\mathbf{0} n1inϕ(xi)=0 K → K ^ \mathbf{K} \to \hat{\mathbf{K}} KK^,已经中心化;

目标 u , λ , s . t . Σ ϕ u = λ u \mathbf{u},\lambda,s.t. \mathbf{\Sigma}_{\phi}\mathbf{u}=\lambda\mathbf{u} u,λ,s.t.Σϕu=λu
1 n ∑ i = 1 n ϕ ( x i ) [ ϕ ( x i ) T u ] = λ u ∑ i = 1 n [ ϕ ( x i ) T u n λ ] ϕ ( x i ) = u \begin{aligned} \frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)[\phi(\mathbf{x}_i)^T\mathbf{u}] &=\lambda\mathbf{u}\\ \sum\limits_{i=1}^n[\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda}] \phi(\mathbf{x}_i)&=\mathbf{u}\\ \end{aligned} n1i=1nϕ(xi)[ϕ(xi)Tu]i=1n[nλϕ(xi)Tu]ϕ(xi)=λu=u
相同于所有数据线性组合。

令: c i = ϕ ( x i ) T u n λ c_i=\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda} ci=nλϕ(xi)Tu,则 u = ∑ i = 1 n c i ϕ ( x i ) \mathbf{u}=\sum\limits_{i=1}^nc_i \phi(\mathbf{x}_i) u=i=1nciϕ(xi)。代入原式:
( 1 n ∑ i = 1 n ϕ ( x i ) ϕ ( x i ) T ) ( ∑ j = 1 n c j ϕ ( x j ) ) = λ ∑ i = 1 n c i ϕ ( x i ) 1 n ∑ i = 1 n ∑ j = 1 n c j ϕ ( x i ) ϕ ( x i ) T ϕ ( x j ) = λ ∑ i = 1 n c i ϕ ( x i ) ∑ i = 1 n ( ϕ ( x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i ϕ ( x i ) \begin{aligned} \left(\frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T}\right)\left(\sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{j}\right)\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(\phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \end{aligned} (n1i=1nϕ(xi)ϕ(xi)T)(j=1ncjϕ(xj))n1i=1nj=1ncjϕ(xi)ϕ(xi)Tϕ(xj)i=1n(ϕ(xi)j=1ncjK(xi,xj))=λi=1nciϕ(xi)=λi=1nciϕ(xi)=nλi=1nciϕ(xi)
注意,此处 K = K ^ \mathbf{K}=\hat{\mathbf{K}} K=K^ 已经中心化

对于 ∀ k ( 1 ≤ k ≤ n ) \forall k (1\le k\le n) k(1kn),两边同时左乘 ϕ ( x k ) \phi(\mathbf{x}_{k}) ϕ(xk)
∑ i = 1 n ( ϕ T ( x k ) ϕ ( x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i ϕ T ( x k ) ϕ ( x i ) ∑ i = 1 n ( K ( x k , x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i K ( x k , x i ) \begin{aligned} \sum_{i=1}^{n}\left(\phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(K(\mathbf{x}_k, \mathbf{x}_i) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} K(\mathbf{x}_k, \mathbf{x}_i) \\ \end{aligned} i=1n(ϕT(xk)ϕ(xi)j=1ncjK(xi,xj))i=1n(K(xk,xi)j=1ncjK(xi,xj))=nλi=1nciϕT(xk)ϕ(xi)=nλi=1nciK(xk,xi)
K i = ( K ( x i , x 1 ) , K ( x i , x 2 ) , ⋯   , K ( x i , x n ) ) T \mathbf{K}_{i}=\left(K\left(\mathbf{x}_{i}, \mathbf{x}_{1}\right), K\left(\mathbf{x}_{i}, \mathbf{x}_{2}\right), \cdots, K\left(\mathbf{x}_{i}, \mathbf{x}_{n}\right)\right)^{T} Ki=(K(xi,x1),K(xi,x2),,K(xi,xn))T (核矩阵的第 i i i 行, K = ( [ K 1 T ⋮ K n T ] ) \mathbf{K}=(\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix}) K=(K1TKnT)), c = ( c 1 , c 2 , ⋯   , c n ) T \mathbf{c}=(c_1,c_2,\cdots,c_n)^T c=(c1,c2,,cn)T,则:
∑ i = 1 n K ( x k , x i ) K i T c = n λ K k T c , k = 1 , 2 , ⋯   , n K k T [ K 1 T ⋮ K n T ] c = n λ K k T c K k T K = n λ K k T c \begin{aligned} \sum_{i=1}^{n}K(\mathbf{x}_k, \mathbf{x}_i) \mathbf{K}^T_i\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c},k=1,2,\cdots,n \\ \mathbf{K}^T_k\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix}\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c}\\ \mathbf{K}^T_k\mathbf{K} &=n \lambda \mathbf{K}^T_k\mathbf{c} \end{aligned} i=1nK(xk,xi)KiTcKkTK1TKnTcKkTK=nλKkTc,k=1,2,,n=nλKkTc=nλKkTc
K 2 c = n λ K c \mathbf{K}^2\mathbf{c}=n\lambda \mathbf{K}\mathbf{c} K2c=nλKc

假设 K − 1 \mathbf{K}^{-1} K1 存在
K 2 c = n λ K c K c = n λ c K c = η c , η = n λ \begin{aligned} \mathbf{K}^2\mathbf{c}&=n\lambda \mathbf{K}\mathbf{c}\\ \mathbf{K}\mathbf{c}&=n\lambda \mathbf{c}\\ \mathbf{K}\mathbf{c}&= \eta\mathbf{c},\eta=n\lambda \end{aligned} K2cKcKc=nλKc=nλc=ηc,η=nλ
结论: η 1 n ≥ η 2 n ≥ ⋯ ≥ η n n \frac{\eta_1}{n}\ge\frac{\eta_2}{n}\ge\cdots\ge\frac{\eta_n}{n} nη1nη2nηn,给出在特征空间中 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯   , ϕ ( x n ) \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n) ϕ(x1),ϕ(x2),,ϕ(xn) 的投影方差: ∑ i = 1 r η r n \sum\limits_{i=1}^{r}\frac{\eta_r}{n} i=1rnηr,其中 η 1 ≥ η 2 ⋯ ≥ η n \eta_1\ge\eta_2\cdots\ge\eta_n η1η2ηn K \mathbf{K} K 的特征值。

问:可否计算出 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯   , ϕ ( x n ) \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n) ϕ(x1),ϕ(x2),,ϕ(xn) 在主元方向上的投影(即降维之后的数据)?

u 1 , ⋯   , u d \mathbf{u}_1,\cdots,\mathbf{u}_d u1,,ud Σ ϕ \mathbf{\Sigma}_{\phi} Σϕ 的特征向量,则 ϕ ( x j ) = a 1 u 1 + ⋯ + a d u d \phi(\mathbf{x}_j)=a_1\mathbf{u}_1+\cdots+a_d\mathbf{u}_d ϕ(xj)=a1u1++adud,其中
a k = ϕ ( x j ) T u k , k = 1 , 2 , ⋯   , d = ϕ ( x j ) T ∑ i = 1 n c k i ϕ ( x i ) = ∑ i = 1 n c k i ϕ ( x j ) T ϕ ( x i ) = ∑ i = 1 n c k i K ( x j , x i ) \begin{aligned} a_k &= \phi(\mathbf{x}_j)^T\mathbf{u}_k, k=1,2,\cdots,d\\ &= \phi(\mathbf{x}_j)^T\sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_j)^T\phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} K(\mathbf{x}_j,\mathbf{x}_i) \end{aligned} ak=ϕ(xj)Tuk,k=1,2,,d=ϕ(xj)Ti=1nckiϕ(xi)=i=1nckiϕ(xj)Tϕ(xi)=i=1nckiK(xj,xi)

算法7.2:核主元分析( F ⊆ R d \mathcal{F}\subseteq \mathbb{R}^d FRd

输入: K K K α \alpha α

输出: A A A (降维后数据的投影坐标)

  1. K ^ : = ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) \hat{\mathbf{K}} :=\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \mathbf{K}\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) K^:=(In11n×n)K(In11n×n)

  2. η 1 , η 2 , ⋯ η d \eta_1,\eta_2,\cdots\eta_d η1,η2,ηd ⟵ K \longleftarrow \mathbf{K} K 的特征值,只取前 d d d

  3. c 1 , c 2 , ⋯   , c d \mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_d c1,c2,,cd ⟵ K \longleftarrow \mathbf{K} K 的特征向量(单位化,正交)

  4. c i ← 1 η i ⋅ c i , i = 1 , ⋯   , d \mathbf{c}_i \leftarrow \frac{1}{\sqrt{\eta_i}}\cdot \mathbf{c}_i,i=1,\cdots,d ciηi 1ci,i=1,,d

  5. 选取最小的 r r r 使得: ∑ i = 1 r η i n ∑ i = 1 d η i n ≥ α \frac{\sum\limits_{i=1}^r\frac{\eta_i}{n}}{\sum\limits_{i=1}^d\frac{\eta_i}{n}}\ge \alpha i=1dnηii=1rnηiα

  6. C r = ( c 1 , c 2 , ⋯   , c r ) \mathbf{C}_r=(\mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_r) Cr=(c1,c2,,cr)

  7. A = { a i ∣ a i = C r T K i , i = 1 , ⋯   , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{C}_r^T\mathbf{K}_i, i=1,\cdots,n\} A={aiai=CrTKi,i=1,,n}

Chapter 14:Hierarchical Clustering 分层聚类

14.1 预备

Def.1 给定数据集 D = { x 1 , x 2 , ⋯   , x n } , ( x i ∈ R d ) \mathbf{D}=\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\},(\mathbf{x}_i\in \mathbb{R}^d) D={x1,x2,,xn},(xiRd) D \mathbf{D} D 的一个聚类是指 D \mathbf{D} D 的划分 C = { C 1 , C 2 , ⋯   , C k } \mathcal{C}=\{C_1,C_2,\cdots,C_k \} C={C1,C2,,Ck} s.t. C i ⊆ D , C i ∩ C j = ∅ , ∪ i = 1 k C i = D C_i\subseteq \mathbf{D},C_i \cap C_j=\emptyset, \cup_{i=1}^k C_i=\mathbf{D} CiD,CiCj=,i=1kCi=D

称聚类 A = { A 1 , ⋯   , A r } \mathcal{A}=\{A_1,\cdots,A_r\} A={A1,,Ar} 是聚类 B = { B 1 , ⋯   , B s } \mathcal{B}=\{B_1,\cdots,B_s\} B={B1,,Bs} 的嵌套,如果 r > s r>s r>s,且对于 ∀ A i ∈ A \forall A_i \in \mathcal{A} AiA ,存在 B j ∈ B B_j \in \mathcal{B} BjB 使得 A i ⊆ B j A_i \subseteq B_j AiBj

D \mathbf{D} D 的分层聚类是指一个嵌套聚类序列 C 1 , ⋯   , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,,Cn,其中 C 1 = { { x 1 } , { x 2 } , ⋯   , { x n } } , ⋯   , C n = { { x 1 , x 2 , ⋯   , x n } } \mathcal{C}_1=\{ \{\mathbf{x}_1\},\{\mathbf{x}_2\},\cdots,\{\mathbf{x}_n\}\},\cdots,\mathcal{C}_n=\{\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\} \} C1={{x1},{x2},,{xn}},,Cn={{x1,x2,,xn}},且 C t \mathcal{C}_t Ct C t + 1 \mathcal{C}_{t+1} Ct+1 的嵌套。

Def.2 分层聚类示图的顶点集是指所有在 C 1 , ⋯   , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,,Cn 中出现的元,如果 C i ∈ C t C_i \in \mathcal{C}_t CiCt C j ∈ C t + 1 C_j \in \mathcal{C}_{t+1} CjCt+1 满足,则 C i C_i Ci C j C_j Cj 之间有一条边。

在这里插入图片描述

事实:

  1. 分层聚类示图是一棵二叉树(不一定,作为假设,假设每层只聚两类),分层聚类与其示图一一对应。
  2. 设(即数据点数为 n n n),则所有可能的分层聚类示图数目为 ( 2 n − 3 ) ! ! (2n-3)!! (2n3)!! (跃乘 1 × 3 × 5 × ⋯ 1\times 3 \times 5 \times \cdots 1×3×5×

14.2 团聚分层聚类

算法14.1 :

输入: D , k \mathbf{D}, k D,k

输出: C \mathcal{C} C

  1. C ← { C i = { x i } ∣ x i ∈ D } \mathcal{C} \leftarrow \{C_i=\{\mathbf{x}_i\}|\mathbf{x}_i \in \mathbf{D} \} C{Ci={xi}xiD}
  2. Δ ← { δ ( x i , x j ) : x i , x j ∈ D } \Delta \leftarrow \{\delta(\mathbf{x}_i,\mathbf{x}_j):\mathbf{x}_i,\mathbf{x}_j \in \mathbf{D} \} Δ{δ(xi,xj):xi,xjD}
  3. repeat
  4. ​ 寻找最近的对 C i , C j ∈ C C_i,C_j \in \mathcal{C} Ci,CjC
  5. C i j ← C i ∪ C j C_{ij}\leftarrow C_i \cup C_j CijCiCj
  6. C ← ( C ∣ { C i , C j } ) ∪ C i j \mathcal{C}\leftarrow (\mathcal{C} | \{C_i,C_j \}) \cup {C_{ij}} C(C{Ci,Cj})Cij
  7. ​ 根据 C \mathcal{C} C 更新距离矩阵 Δ \Delta Δ
  8. Until ∣ C ∣ = k |\mathcal{C}|=k C=k

问题:如何定义/计算 C i , C j C_i,C_j Ci,Cj 的距离,即 δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) ?

δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) 有以下五种不同方式:

  1. 简单连接: δ ( C i , C j ) : = min ⁡ { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \min \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=min{δ(x,y)xCi,yCj}

  2. 完全连接: δ ( C i , C j ) : = max ⁡ { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \max \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=max{δ(x,y)xCi,yCj}

  3. 组群平均: δ ( C i , C j ) : = ∑ x ∈ C i ∑ y ∈ C j δ ( x , y ) n i ⋅ n j , n i = ∣ C i ∣ , n j = ∣ C j ∣ \delta(C_i,C_j):= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_j}\delta(\mathbf{x},\mathbf{y})}{n_i \cdot n_j}, n_i=|C_i|,n_j=|C_j| δ(Ci,Cj):=ninjxCiyCjδ(x,y),ni=Ci,nj=Cj

  4. 均值距离: δ ( C i , C j ) : = ∣ ∣ μ i − μ j ∣ ∣ 2 , μ i = 1 n ∑ x ∈ C i x , μ j = 1 n ∑ y ∈ C j y \delta(C_i,C_j):= ||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2,\boldsymbol{\mu}_i=\frac{1}{n}\sum\limits_{\mathbf{x} \in C_i}\mathbf{x},\boldsymbol{\mu}_j=\frac{1}{n}\sum\limits_{\mathbf{y} \in C_j}\mathbf{y} δ(Ci,Cj):=μiμj2,μi=n1xCix,μj=n1yCjy

  5. 极小方差:对任意 C i C_i Ci,定义平方误差和 S S E i = ∑ x ∈ C i ∣ ∣ x − μ i ∣ ∣ 2 SSE_i= \sum\limits_{\mathbf{x} \in C_i} ||\mathbf{x}-\boldsymbol{\mu}_i|| ^2 SSEi=xCixμi2

    C i , C j , S S E i j : = ∑ x ∈ C i ∪ C j ∣ ∣ x − μ i j ∣ ∣ 2 C_i,C_j,SSE_{ij}:=\sum\limits_{\mathbf{x} \in C_i\cup C_j} ||\mathbf{x}-\boldsymbol{\mu}_{ij}|| ^2 Ci,Cj,SSEij:=xCiCjxμij2,其中 μ i j : = 1 n i + n j ∑ x ∈ C i ∪ C j x \boldsymbol{\mu}_{ij}:=\frac{1}{n_i+n_j}\sum\limits_{\mathbf{x} \in C_i\cup C_j}\mathbf{x} μij:=ni+nj1xCiCjx

    δ ( C i , C j ) : = S S E i j − S S E i − S S E j \delta(C_i,C_j):=SSE_{ij}-SSE_i-SSE_j δ(Ci,Cj):=SSEijSSEiSSEj

证明: δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninjμiμj2

简记: C i j : = C i ∪ C j , n i j : = n i + n j C_{ij}:=C_i\cup C_j,n_{ij}:=n_i+n_j Cij:=CiCj,nij:=ni+nj

注意: C i ∩ C j = ∅ C_i \cap C_j=\emptyset CiCj=,故 ∣ C i j ∣ = n i + n j |C_{ij}|=n_i+n_j Cij=ni+nj
δ ( C i , C j ) = ∑ z ∈ C i j ∥ z − μ i j ∥ 2 − ∑ x ∈ C i ∥ x − μ i ∥ 2 − ∑ y ∈ C j ∥ y − μ j ∥ 2 = ∑ z ∈ C i j z T z − n i j μ i j T μ i j − ∑ x ∈ C i x T x + n i μ i T μ i − ∑ y ∈ C j y T y + n j μ j T μ j = n i μ i T μ i + n j μ j T μ j − ( n i + n j ) μ i j T μ i j \begin{aligned} \delta\left(C_{i}, C_{j}\right) &=\sum_{\mathbf{z} \in C_{i j}}\left\|\mathbf{z}-\boldsymbol{\mu}_{i j}\right\|^{2}-\sum_{\mathbf{x} \in C_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}-\sum_{\mathbf{y} \in C_{j}}\left\|\mathbf{y}-\boldsymbol{\mu}_{j}\right\|^{2} \\ &=\sum_{\mathbf{z} \in C_{i j}} \mathbf{z}^{T} \mathbf{z}-n_{i j} \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}-\sum_{\mathbf{x} \in C_{i}} \mathbf{x}^{T} \mathbf{x}+n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-\sum_{\mathbf{y} \in C_{j}} \mathbf{y}^{T} \mathbf{y}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j} \\ &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j} \end{aligned} δ(Ci,Cj)=zCijzμij2xCixμi2yCjyμj2=zCijzTznijμijTμijxCixTx+niμiTμiyCjyTy+njμjTμj=niμiTμi+njμjTμj(ni+nj)μijTμij
注意到: μ i j = 1 n i j ∑ z ∈ C i j z = 1 n i + n j ( ∑ x ∈ C i x + ∑ y ∈ C j y ) = 1 n i + n j ( n i μ i + n j μ j ) \boldsymbol{\mu}_{i j}=\frac{1}{n_{ij}}\sum\limits_{\mathbf{z} \in C_{ij}} \mathbf{z}=\frac{1}{n_i+n_j}(\sum\limits_{\mathbf{x} \in C_{i}} \mathbf{x}+\sum\limits_{\mathbf{y} \in C_{j}} \mathbf{y})=\frac{1}{n_i+n_j}(n_i\boldsymbol{\mu}_{i}+n_j\boldsymbol{\mu}_{j}) μij=nij1zCijz=ni+nj1(xCix+yCjy)=ni+nj1(niμi+njμj)

故: μ i j T μ i j = 1 ( n i + n j ) 2 ( n i 2 μ i T μ i + 2 n i n j μ i T μ j + n j 2 μ j T μ j ) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}=\frac{1}{\left(n_{i}+n_{j}\right)^{2}}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) μijTμij=(ni+nj)21(ni2μiTμi+2ninjμiTμj+nj2μjTμj)
δ ( C i , C j ) = n i μ i T μ i + n j μ j T μ j − 1 ( n i + n j ) ( n i 2 μ i T μ i + 2 n i n j μ i T μ j + n j 2 μ j T μ j ) = n i ( n i + n j ) μ i T μ i + n j ( n i + n j ) μ j T μ j − n i 2 μ i T μ i − 2 n i n j μ i T μ j − n j 2 μ j T μ j n i + n j = n i n j ( μ i T μ i − 2 μ i T μ j + μ j T μ j ) n i + n j = ( n i n j n i + n j ) ∥ μ i − μ j ∥ 2 \begin{aligned} \delta\left(C_{i}, C_{j}\right) &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \mu_{j}^{T} \boldsymbol{\mu}_{j}-\frac{1}{\left(n_{i}+n_{j}\right)}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) \\ &=\frac{n_{i}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}-n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}}{n_{i}+n_{j}} \\ &=\frac{n_{i} n_{j}\left(\boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+\boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right)}{n_{i}+n_{j}} \\ &=\left(\frac{n_{i} n_{j}}{n_{i}+n_{j}}\right)\left\|\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right\|^{2} \end{aligned} δ(Ci,Cj)=niμiTμi+njμjTμj(ni+nj)1(ni2μiTμi+2ninjμiTμj+nj2μjTμj)=ni+njni(ni+nj)μiTμi+nj(ni+nj)μjTμjni2μiTμi2ninjμiTμjnj2μjTμj=ni+njninj(μiTμi2μiTμj+μjTμj)=(ni+njninj)μiμj2
问题:如何快速计算算法14.1 第7步:更新矩阵?

Lance–Williams formula
δ ( C i j , C r ) = α i ⋅ δ ( C i , C r ) + α j ⋅ δ ( C j , C r ) + β ⋅ δ ( C i , C j ) + γ ⋅ ∣ δ ( C i , C r ) − δ ( C j , C r ) ∣ \begin{array}{r} \delta\left(C_{i j}, C_{r}\right)=\alpha_{i} \cdot \delta\left(C_{i}, C_{r}\right)+\alpha_{j} \cdot \delta\left(C_{j}, C_{r}\right)+ \\ \beta \cdot \delta\left(C_{i}, C_{j}\right)+\gamma \cdot\left|\delta\left(C_{i}, C_{r}\right)-\delta\left(C_{j}, C_{r}\right)\right| \end{array} δ(Cij,Cr)=αiδ(Ci,Cr)+αjδ(Cj,Cr)+βδ(Ci,Cj)+γδ(Ci,Cr)δ(Cj,Cr)

Measure α i \alpha_i αi α j \alpha_j αj β \beta β γ \gamma γ
简单连接 1 2 1\over2 21 1 2 1\over2 21 0 0 0 − 1 2 -{1\over2} 21
完全连接 1 2 1\over2 21 1 2 1\over2 21 0 0 0 1 2 1\over2 21
组群平均 n i n i + n j \frac{n_i}{n_i+n_j} ni+njni n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj 0 0 0 0 0 0
均值距离 n i n i + n j \frac{n_i}{n_i+n_j} ni+njni n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj − n i n j ( n i + n j ) 2 \frac{-n_in_j}{(n_i+n_j)^2} (ni+nj)2ninj 0 0 0
极小方差 n i + n r n i + n j + n r \frac{n_i+n_r}{n_i+n_j+n_r} ni+nj+nrni+nr n j + n r n i + n j + n r \frac{n_j+n_r}{n_i+n_j+n_r} ni+nj+nrnj+nr − n r n i + n j + n r \frac{-n_r}{n_i+n_j+n_r} ni+nj+nrnr 0 0 0

Proof:

  1. 简单连接
    δ ( C i j , C r ) = min ⁡ { δ ( x , y ) ∣ x ∈ C i j , y ∈ C r } = min ⁡ { δ ( C i , C r ) , δ ( C j , C r ) } a = a + b − ∣ a − b ∣ 2 , b = a + b + ∣ a − b ∣ 2 \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \min \{\delta({\mathbf{x}, \mathbf{y}} )|\mathbf{x}\in C_{ij}, \mathbf{y} \in C_r\} \\ &= \min \{\delta({C_{i}, C_{r}), \delta(C_{j}, C_{r}} )\} \end{aligned}\\ a=\frac{a+b-|a-b|}{2},b=\frac{a+b+|a-b|}{2} δ(Cij,Cr)=min{δ(x,y)xCij,yCr}=min{δ(Ci,Cr),δ(Cj,Cr)}a=2a+bab,b=2a+b+ab
    在这里插入图片描述

  2. 完全连接

    见上图

  3. 组群平均
    δ ( C i j , C r ) = ∑ x ∈ C i ∪ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = ∑ x ∈ C i ∑ y ∈ C r δ ( x , y ) + ∑ x ∈ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = n i n r δ ( C i , C r ) + n j n r δ ( C j , C r ) ( n i + n j ) ⋅ n r = n i δ ( C i , C r ) + n j δ ( C j , C r ) ( n i + n j ) \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \frac{\sum\limits_{\mathbf{x} \in C_i\cup C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})+\sum\limits_{\mathbf{x} \in C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &=\frac{n_in_r\delta(C_i,C_r)+n_jn_r\delta(C_j,C_r)}{(n_i+n_j )\cdot n_r}\\ &=\frac{n_i\delta(C_i,C_r)+n_j\delta(C_j,C_r)}{(n_i+n_j )} \end{aligned} δ(Cij,Cr)=(ni+nj)nrxCiCjyCrδ(x,y)=(ni+nj)nrxCiyCrδ(x,y)+xCjyCrδ(x,y)=(ni+nj)nrninrδ(Ci,Cr)+njnrδ(Cj,Cr)=(ni+nj)niδ(Ci,Cr)+njδ(Cj,Cr)

  4. 均值距离:作业

  5. 极小方差

    基于均值距离的结论再代入 δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninjμiμj2

事实:算法14.1 的复杂度为 O ( n 2 log ⁡ n ) O(n^2\log n) O(n2logn)

Chapter 15:基于密度的聚类

适用数据类型:非凸,又称非凸聚类;K-means 适用于凸数据

15.1 DBSCAN 算法

  • 定义记号: ∀ x ∈ R d , N ϵ ( x ) : = { y ∈ R d ∣ δ ( x − y ) ≤ ϵ } \forall \mathbf{x}\in \mathbb{R}^d,N_{\epsilon}(\mathbf{x}):=\{\mathbf{y}\in \mathbb{R}^d|\delta(\mathbf{x}-\mathbf{y})\le\epsilon \} xRd,Nϵ(x):={yRdδ(xy)ϵ},其中 δ ( x − y ) = ∣ ∣ x − y ∣ ∣ \delta(\mathbf{x}-\mathbf{y})=||\mathbf{x}-\mathbf{y}|| δ(xy)=xy 欧式距离,其他距离也可。 D ⊆ R d \mathbf{D}\subseteq \mathbb{R}^d DRd

Def.1 m i n p t s ∈ N + minpts \in \mathbb{N}_+ minptsN+ 是用户定义的局部密度,如果 ∣ N ϵ ( x ) ∩ D ∣ ≥ m i n p t s |N_{\epsilon}(\mathbf{x})\cap\mathbf{D}|\ge minpts Nϵ(x)Dminpts ,则称 x \mathbf{x} x D \mathbf{D} D 核心点;如果 ∣ N ϵ ( x ) ∩ D ∣ < m i n p t s |N_{\epsilon}(\mathbf{x})\cap\mathbf{D}|< minpts Nϵ(x)D<minpts ,且 x ∈ N ϵ ( z ) \mathbf{x}\in N_{\epsilon}(\mathbf{z}) xNϵ(z) ,其中 z \mathbf{z} z D \mathbf{D} D 的核心点,则称 x \mathbf{x} x D \mathbf{D} D 的边缘点;如果 x \mathbf{x} x 既不是核心点又不是边缘点,则称 x \mathbf{x} x D \mathbf{D} D 的噪点。

Def.2 如果 x ∈ N ϵ ( y ) \mathbf{x}\in N_{\epsilon}(\mathbf{y}) xNϵ(y) y \mathbf{y} y 是核心点,则称 x \mathbf{x} x y \mathbf{y} y 是直接密度可达的。如果存在点列 x 0 , x 1 , ⋯   , x l \mathbf{x}_0,\mathbf{x}_1,\cdots,\mathbf{x}_l x0,x1,,xl,使得 x 0 = x , x l = y \mathbf{x}_0=\mathbf{x},\mathbf{x}_l=\mathbf{y} x0=x,xl=y,且 x i \mathbf{x}_{i} xi x i − 1 \mathbf{x}_{i-1} xi1 是直接密度可达,则称 x \mathbf{x} x y \mathbf{y} y 是密度可达。

Def.3 如果存在 z ∈ D \mathbf{z}\in \mathbf{D} zD,使得 x \mathbf{x} x y \mathbf{y} y z \mathbf{z} z 都是密度可达的,称 x \mathbf{x} x y \mathbf{y} y 是密度连通的。

Def.4 基于密度的聚类是指基数最大的密度连通集(即集合内任意两点都是密度连通)。

算法15.1 : DBSCAN ( O ( n 2 ) O(n^2) O(n2))

输入: D , ϵ , m i n p t s \mathbf{D}, \epsilon, minpts D,ϵ,minpts

输出: C , C o r e , B o r d e r , N o i s e \mathcal{C},Core,Border,Noise C,Core,Border,Noise

  1. C o r e ← ∅ Core \leftarrow \emptyset Core

  2. 对每一个 x i ∈ D \mathbf{x}_i\in \mathbf{D} xiD

    2.1 计算 N ϵ ( x i ) ( ⊆ D ) N_\epsilon(\mathbf{x}_i)(\subseteq \mathbf{D}) Nϵ(xi)(D)

    2.2 i d ( x i ) ← ∅ id(\mathbf{x}_i)\leftarrow \emptyset id(xi)

    2.3 如果 N ϵ ( x i ) ≥ m i n p t s N_\epsilon(\mathbf{x}_i)\ge minpts Nϵ(xi)minpts,则 C o r e ← C o r e ∪ { x i } Core\leftarrow Core \cup \{ \mathbf{x}_i\} CoreCore{xi}

  3. k ← 0 k\leftarrow 0 k0

  4. 对每一个 x i ∈ C o r e , s . t . i d ( x i ) = ∅ \mathbf{x}_i\in Core, s.t.id(\mathbf{x}_i)= \emptyset xiCore,s.t.id(xi)=,执行

    4.1 k ← k + 1 k\leftarrow k+1 kk+1

    4.2 i d ( x i ) ← k id(\mathbf{x}_i)\leftarrow k id(xi)k

    4.3 D e n s i t y C o n n e c t e d ( x i , k ) Density Connected (\mathbf{x}_i,k) DensityConnected(xi,k)

  5. C ← { C i } i = 1 k \mathcal{C}\leftarrow \{ C_i\}_{i=1}^k C{Ci}i=1k,其中 C i ← { x i ∈ D ∣ i d ( x i ) = i } C_i\leftarrow \{\mathbf{x}_i \in \mathbf{D} |id(\mathbf{x}_i)=i\} Ci{xiDid(xi)=i}

  6. N o i s e ← { x i ∈ D ∣ i d ( x i ) = ∅ } Noise \leftarrow \{\mathbf{x}_i \in \mathbf{D} |id(\mathbf{x}_i)=\emptyset\} Noise{xiDid(xi)=}

  7. B o r d e r ← D ∖ { C o r e ∪ N o i s e } Border\leftarrow \mathbf{D}\setminus \{Core\cup Noise \} BorderD{CoreNoise}

  8. return C , C o r e , B o r d e r , N o i s e \mathcal{C},Core,Border,Noise C,Core,Border,Noise

D e n s i t y C o n n e c t e d ( x i , k ) Density Connected (\mathbf{x}_i,k) DensityConnected(xi,k)

  1. 对于每一个 y ∈ N ϵ ( x ) ∖ x \mathbf{y} \in N_\epsilon(\mathbf{x}) \setminus {\mathbf{x}} yNϵ(x)x

    1.1 i d ( y ) ← k id(\mathbf{y})\leftarrow k id(y)k

    1.2 如果 y ∈ C o r e \mathbf{y}\in Core yCore,则 D e n s i t y C o n n e c t e d ( y , k ) Density Connected (\mathbf{y},k) DensityConnected(y,k)

Remark:DBSCAN 对 ε \varepsilon ε 敏感: ε \varepsilon ε 过小,稀疏的类可能被认作噪点; ε \varepsilon ε 过大,稠密的类可能无法区分。

15.2 密度估计函数(DEF)

∀ z ∈ R d \forall \mathbf{z}\in \mathbb{R}^d zRd,定义 K ( z ) = 1 ( 2 π ) d / 2 e − z T z 2 K(\mathbf{z})=\frac{1}{(2\pi)^{d/2}}e^{-\frac{\mathbf{z}^T\mathbf{z}}{2}} K(z)=(2π)d/21e2zTz ∀ x ∈ R d , f ^ ( x ) : = 1 n h d ∑ i = 1 n K ( x − x i h ) \forall \mathbf{x}\in \mathbb{R}^d,\hat{f}(\mathbf{x}):=\frac{1}{nh^d}\sum\limits_{i=1}^{n}K(\frac{\mathbf{x}-\mathbf{x}_i}{h}) xRd,f^(x):=nhd1i=1nK(hxxi)

其中 h > 0 h>0 h>0 是用户指定的步长, { x 1 , ⋯   , x n } \{\mathbf{x}_1,\cdots,\mathbf{x}_n\} {x1,,xn} 是给定的数据集

15.3 DENCLUE

Def.1 x ∗ ∈ R d \mathbf{x}^*\in \mathbb{R}^d xRd 是密度吸引子,如果它决定概率密度函数 f f f 的一个局部最大值。(PDF一般未知)

x ∗ ∈ R d \mathbf{x}^*\in \mathbb{R}^d xRd x ∈ R d \mathbf{x}\in \mathbb{R}^d xRd 的密度吸引子,如果存在 x 0 , x 1 , … , x m \mathbf{x}_0,\mathbf{x}_1,\dots,\mathbf{x}_m x0,x1,,xm,使得 x 0 = x , ∣ ∣ x m − x ∗ ∣ ∣ ≤ ϵ \mathbf{x}_0=\mathbf{x},||\mathbf{x}_m-\mathbf{x}^*||\le\epsilon x0=x,xmxϵ,且 x t + 1 = x t + δ ⋅ ∇ f ^ ( x t ) ( 1 ) \mathbf{x}_{t+1}=\mathbf{x}_{t}+\delta \cdot \nabla \hat{f}(\mathbf{x}_{t})\quad (1) xt+1=xt+δf^(xt)(1)

其中 ϵ , δ > 0 \epsilon,\delta >0 ϵ,δ>0 是用户定义的误差及步长, f ^ \hat{f} f^ 是DEF。

  • 更高效的迭代公式:

动机:当 x \mathbf{x} x 靠近 x ∗ \mathbf{x}^* x 时,迭代公式(1)迭代效率低 ∇ f ^ ( x ∗ ) = 0 \nabla \hat{f}(\mathbf{x}^*)=0 f^(x)=0

∇ f ^ ( x ) = ∂ ∂ x f ^ ( x ) = 1 n h d ∑ i = 1 n ∂ ∂ x K ( x − x i h ) \nabla \hat{f}(\mathbf{x})=\frac{\partial}{\partial \mathbf{x}} \hat{f}(\mathbf{x})=\frac{1}{n h^{d}} \sum\limits_{i=1}^{n} \frac{\partial}{\partial \mathbf{x}} K\left(\frac{\mathbf{x}-\mathbf{x}_{i}}{h}\right) f^(x)=xf^(x)=nhd1i=1nxK(hxxi)

∂ ∂ x K ( z ) = ( 1 ( 2 π ) d / 2 exp ⁡ { − z T z 2 } ) ⋅ − z ⋅ ∂ z ∂ x = K ( z ) ⋅ − z ⋅ ∂ z ∂ x \begin{aligned} \frac{\partial}{\partial \mathbf{x}} K(\mathbf{z}) &=\left(\frac{1}{(2 \pi)^{d / 2}} \exp \left\{-\frac{\mathbf{z}^{T} \mathbf{z}}{2}\right\}\right) \cdot-\mathbf{z} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{x}} \\ &=K(\mathbf{z}) \cdot-\mathbf{z} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{x}} \end{aligned} xK(z)=((2π)d/21exp{2zTz})zxz=K(z)zxz

z = x − x i h \mathbf{z}=\frac{\mathbf{x}-\mathbf{x}_i}{h} z=hxxi 代入得: ∂ ∂ x K ( x − x i h ) = K ( x − x i h ) ⋅ ( x i − x h ) ⋅ ( 1 h ) \frac{\partial}{\partial \mathbf{x}} K\left(\frac{\mathbf{x}-\mathbf{x}_{i}}{h}\right)=K\left(\frac{\mathbf{x}-\mathbf{x}_{i}}{h}\right) \cdot\left(\frac{\mathbf{x}_{i}-\mathbf{x}}{h}\right) \cdot\left(\frac{1}{h}\right) xK(hxxi)=K(hxxi)(hxix)(h1)

故有: ∇ f ^ ( x ) = 1 n h d + 2 ∑ i = 1 n K ( x − x i h ) ⋅ ( x i − x ) \nabla \hat{f}(\mathbf{x})=\frac{1}{n h^{d+2}} \sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}-\mathbf{x}_{i}}{h}\right) \cdot\left(\mathbf{x}_{i}-\mathbf{x}\right) f^(x)=nhd+21i=1nK(hxxi)(xix)

则: 1 n h d + 2 ∑ i = 1 n K ( x ∗ − x i h ) ⋅ ( x i − x ∗ ) = 0 \frac{1}{n h^{d+2}} \sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}^*-\mathbf{x}_{i}}{h}\right) \cdot\left(\mathbf{x}_{i}-\mathbf{x}^*\right)=0 nhd+21i=1nK(hxxi)(xix)=0

故有: x ∗ = ∑ i = 1 n K ( x ∗ − x i h ) ⋅ x i ∑ i = 1 n K ( x ∗ − x i h ) ( 2 ) \mathbf{x}^*=\frac{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}^*-\mathbf{x}_{i}}{h}\right)\cdot \mathbf{x}_{i}}{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}^*-\mathbf{x}_{i}}{h}\right)}\quad (2) x=i=1nK(hxxi)i=1nK(hxxi)xi(2)

由(1): x t + 1 − x t = δ ⋅ ∇ f ^ ( x t ) \mathbf{x}_{t+1}-\mathbf{x}_{t}=\delta \cdot \nabla \hat{f}(\mathbf{x}_{t}) xt+1xt=δf^(xt),(靠近 x ∗ \mathbf{x}^* x 时)近似有: x t + 1 − x t ≈ 0 \mathbf{x}_{t+1}-\mathbf{x}_{t}\approx0 xt+1xt0

且: x t = ∑ i = 1 n K ( x t − x i h ) ⋅ x i ∑ i = 1 n K ( x t − x i h ) \mathbf{x}_t=\frac{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)\cdot \mathbf{x}_{i}}{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)} xt=i=1nK(hxtxi)i=1nK(hxtxi)xi

故: x t + 1 = ∑ i = 1 n K ( x t − x i h ) ⋅ x i ∑ i = 1 n K ( x t − x i h ) \mathbf{x}_{t+1}=\frac{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)\cdot \mathbf{x}_{i}}{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)} xt+1=i=1nK(hxtxi)i=1nK(hxtxi)xi

Def.2 C ⊆ D C\subseteq \mathbf{D} CD 是基于密度的类,如果存在密度吸引子 x 1 ∗ , … , x m ∗ \mathbf{x}^*_1,\dots,\mathbf{x}^*_m x1,,xm s . t : s.t: s.t:

  1. ∀ x ∈ C \forall \mathbf{x}\in C xC 都有某个 x i ∗ \mathbf{x}^*_i xi 使得, x i ∗ \mathbf{x}^*_i xi x \mathbf{x} x 的密度吸引子;
  2. ∀ i , f ^ ( x i ∗ ) ≥ ξ \forall i,\hat{f}(\mathbf{x}^*_i)\ge \xi i,f^(xi)ξ,其中 ξ \xi ξ 是用户指定的极小密度阈值;
  3. ∀ x i ∗ , x j ∗ \forall\mathbf{x}^*_i,\mathbf{x}^*_j xi,xj 都密度可达,即存在路径从 x i ∗ \mathbf{x}^*_i xi x j ∗ \mathbf{x}^*_j xj 使得路径上所有点 y \mathbf{y} y 都有 f ^ ( y ) ≥ ξ \hat{f}(\mathbf{y})\ge\xi f^(y)ξ

算法15.2 : DENCLUE 算法

输入: D , h , ξ , ϵ \mathbf{D},h,\xi,\epsilon D,h,ξ,ϵ

输出: C \mathcal{C} C (基于密度的聚类)

  1. A ← ∅ \mathcal{A}\leftarrow\emptyset A

  2. 对每一个 x ∈ D \mathbf{x}\in \mathbf{D} xD:

    2.1 x ∗ ← F I N D A T T R A C T O R ( x , D , h , ξ , ϵ ) \mathbf{x}^* \leftarrow FINDATTRACTOR(\mathbf{x},\mathbf{D},h,\xi,\epsilon) xFINDATTRACTOR(x,D,h,ξ,ϵ)

    2.2 R ( x ∗ ) ← ∅ R(\mathbf{x}^*)\leftarrow\emptyset R(x)

    2.3 if f ^ ( x ∗ ) ≥ ξ \hat{f}(\mathbf{x}^*)\ge \xi f^(x)ξ then:

    2.4 A ← A ∪ { x ∗ } \mathcal{A}\leftarrow \mathcal{A}\cup\{ \mathbf{x}^*\} AA{x}

    2.5 R ( x ∗ ) ← R ( x ∗ ) ∪ { x ∗ } R(\mathbf{x}^*)\leftarrow R(\mathbf{x}^*)\cup\{ \mathbf{x}^* \} R(x)R(x){x}

  3. C ← { maximal  C ⊆ A ∣ ∀ x i ∗ , x j ∗ ∈ C , 满 足 D e f   2 条 件 3 } \mathcal{C}\leftarrow\{\text{maximal}\ C \subseteq \mathcal{A}| \forall\mathbf{x}^*_i,\mathbf{x}^*_j \in C, 满足 Def \ 2 条件3 \} C{maximal CAxi,xjC,Def 23}

  4. ∀ C ∈ C : \forall C \in \mathcal{C}: CC:

    4.1 对每一个 x ∗ ∈ C \mathbf{x}^*\in C xC,令 C ← C ∪ R ( x ∗ ) C\leftarrow C\cup R(\mathbf{x}^*) CCR(x)

  5. Return C \mathcal{C} C

F I N D A T T R A C T O R ( x , D , h , ξ , ϵ ) FINDATTRACTOR(\mathbf{x},\mathbf{D},h,\xi,\epsilon) FINDATTRACTOR(x,D,h,ξ,ϵ):

  1. t ← 0 t\leftarrow 0 t0

  2. x t = x \mathbf{x}_{t}=\mathbf{x} xt=x

  3. Repeat:

    x t + 1 ← ∑ i = 1 n K ( x t − x i h ) ⋅ x i ∑ i = 1 n K ( x t − x i h ) \mathbf{x}_{t+1}\leftarrow\frac{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)\cdot \mathbf{x}_{i}}{\sum\limits_{i=1}^{n} K\left(\frac{\mathbf{x}_t-\mathbf{x}_{i}}{h}\right)} xt+1i=1nK(hxtxi)i=1nK(hxtxi)xi

    t ← t + 1 t\leftarrow t+1 tt+1

  4. Until ∣ ∣ x t − x t − 1 ∣ ∣ < ϵ ||\mathbf{x}_{t}-\mathbf{x}_{t-1}||<\epsilon xtxt1<ϵ

Chapter 20: Linear Discriminant Analysis

Set up D = { ( x i , y i ) } i = 1 n \mathbf{D}=\{(\mathbf{x}_i,y_i) \}_{i=1}^n D={(xi,yi)}i=1n, 其中 y i = 1 , 2 y_i=1,2 yi=1,2(或 ± 1 \pm 1 ±1 等), D 1 = { x i ∣ y i = 1 } \mathbf{D}_1=\{\mathbf{x}_i|y_i=1 \} D1={xiyi=1} D 2 = { x i ∣ y i = 2 } \mathbf{D}_2=\{\mathbf{x}_i|y_i=2 \} D2={xiyi=2}

Goal:寻找向量 w ∈ R d \mathbf{w}\in \mathbb{R}^d wRd (代表直线方向)使得 D 1 , D 2 \mathbf{D}_1,\mathbf{D}_2 D1,D2 的“平均值”距离最大且“总方差”最小。

20.1 Normal LDA

w ∈ R d , w T w = 1 \mathbf{w} \in \mathbb{R}^d,\mathbf{w}^T\mathbf{w}=1 wRd,wTw=1,则 x i \mathbf{x}_i xi w \mathbf{w} w 方向上的投影为 x i ′ = ( w T x i w T u ) w = a i w , a i = w T x i \mathbf{x}_{i}^{\prime}=\left(\frac{\mathbf{w}^{T} \mathbf{x}_{i}}{\mathbf{w}^{T} \mathbf{u}}\right) \mathbf{w}=a_{i} \mathbf{w},a_{i}=\mathbf{w}^{T} \mathbf{x}_{i} xi=(wTuwTxi)w=aiw,ai=wTxi

D 1 \mathbf{D}_1 D1 中数据在 w \mathbf{w} w 上的投影平均值为:( ∣ D 1 ∣ = n 1 |\mathbf{D}_1|=n_1 D1=n1
m 1 : = 1 n 1 ∑ x i ∈ D 1 a i = μ 1 T w m_1:=\frac{1}{n_1}\sum\limits_{\mathbf{x}_i\in \mathbf{D}_1}a_i=\boldsymbol{\mu}_1^T\mathbf{w} m1:=n11xiD1ai=μ1Tw
投影平均值等于平均值的投影。

类似地: D 2 \mathbf{D}_2 D2 中数据在 w \mathbf{w} w 上的投影平均值为:
m 2 : = 1 n 2 ∑ x i ∈ D 2 a i = μ 2 T w m_2:=\frac{1}{n_2}\sum\limits_{\mathbf{x}_i\in \mathbf{D}_2}a_i=\boldsymbol{\mu}_2^T\mathbf{w} m2:=n21xiD2ai=μ2Tw
目标之一:寻找 w \mathbf{w} w 使得 ( m 1 − m 2 ) 2 (m_1-m_2)^2 (m1m2)2 最大。

对于 D i \mathbf{D}_i Di,定义:
s i 2 = ∑ x k ∈ D i ( a k − m i ) 2 s_i^2=\sum\limits_{\mathbf{x}_k\in \mathbf{D}_i}(a_k-m_i)^2 si2=xkDi(akmi)2
注意: s i 2 = n i σ i 2   ( ∣ D i ∣ = n i ) s_i^2=n_i\sigma^2_i\ (|D_i|=n_i) si2=niσi2 (Di=ni)

Goal:Fisher LDA目标函数:
max ⁡ w J ( w ) = ( m 1 − m 2 ) 2 s 1 2 + s 2 2 \max\limits_{\mathbf{w}}J(\mathbf{w})=\frac{(m_1-m_2)^2}{s_1^2+s_2^2} wmaxJ(w)=s12+s22(m1m2)2
注意: J ( w ) = J ( w 1 , w 2 , ⋯   , w d ) J(\mathbf{w})=J(w_1,w_2,\cdots,w_d) J(w)=J(w1,w2,,wd)
( m 1 − m 2 ) 2 = ( w T ( μ 1 − μ 2 ) ) 2 = w T ( ( μ 1 − μ 2 ) ( μ 1 − μ 2 ) T ) w = w T B w \begin{aligned} \left(m_{1}-m_{2}\right)^{2} &=\left(\mathbf{w}^{T}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)\right)^{2} \\ &=\mathbf{w}^{T}\left(\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)^{T}\right) \mathbf{w} \\ &=\mathbf{w}^{T} \mathbf{B} \mathbf{w} \end{aligned} (m1m2)2=(wT(μ1μ2))2=wT((μ1μ2)(μ1μ2)T)w=wTBw

B \mathbf{B} B 被称为类间扩散矩阵

s 1 2 = ∑ x i ∈ D 1 ( a i − m 1 ) 2 = ∑ x i ∈ D 1 ( w T x i − w T μ 1 ) 2 = ∑ x i ∈ D 1 ( w T ( x i − μ 1 ) ) 2 = w T ( ∑ x i ∈ D 1 ( x i − μ 1 ) ( x i − μ 1 ) T ) w = w T S 1 w \begin{aligned} s_{1}^{2} &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left(a_{i}-m_{1}\right)^{2} \\ &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left(\mathbf{w}^{T} \mathbf{x}_{i}-\mathbf{w}^{T} \boldsymbol{\mu}_{1}\right)^{2} \\ &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left(\mathbf{w}^{T}\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{1}\right)\right)^{2} \\ &=\mathbf{w}^{T}\left(\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{1}\right)\left(\mathbf{x}_{i}-\boldsymbol{\mu}_{1}\right)^{T}\right) \mathbf{w} \\ &=\mathbf{w}^{T} \mathbf{S}_{1} \mathbf{w} \end{aligned} s12=xiD1(aim1)2=xiD1(wTxiwTμ1)2=xiD1(wT(xiμ1))2=wT(xiD1(xiμ1)(xiμ1)T)w=wTS1w

S 1 \mathbf{S}_{1} S1 被称为 D 1 \mathbf{D}_1 D1 的扩散矩阵 S 1 = n 1 Σ 1 \mathbf{S}_{1}=n_1\Sigma_1 S1=n1Σ1

类似地, s 2 2 = w T S 2 w s_{2}^{2}=\mathbf{w}^{T} \mathbf{S}_{2} \mathbf{w} s22=wTS2w

S = S 1 + S 2 \mathbf{S}=\mathbf{S}_{1}+\mathbf{S}_{2} S=S1+S2,则
max ⁡ w J ( w ) = ( m 1 − m 2 ) 2 s 1 2 + s 2 2 = w T B w w T S w \max\limits_{\mathbf{w}}J(\mathbf{w})=\frac{(m_1-m_2)^2}{s_1^2+s_2^2}=\frac{\mathbf{w}^{T} \mathbf{B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S} \mathbf{w}} wmaxJ(w)=s12+s22(m1m2)2=wTSwwTBw

注意:
d d w J ( w ) = 2 B w ( w T S w ) − 2 S w ( w T B w ) ( w T S w ) 2 = 0 \frac{d}{d\mathbf{w}}J(\mathbf{w})=\frac{2\mathbf{B}\mathbf{w}(\mathbf{w}^T\mathbf{S}\mathbf{w})-2\mathbf{S}\mathbf{w}(\mathbf{w}^T\mathbf{B}\mathbf{w})}{(\mathbf{w}^T\mathbf{S}\mathbf{w})^2}=\mathbf{0} dwdJ(w)=(wTSw)22Bw(wTSw)2Sw(wTBw)=0
即有:
B w ( w T S w ) = S w ( w T B w ) B w = S w ⋅ w T B w w T S w B w = J ( w ) ⋅ S w ( ∗ ) \begin{aligned} \mathbf{B}\mathbf{w}(\mathbf{w}^T\mathbf{S}\mathbf{w})&=\mathbf{S}\mathbf{w}(\mathbf{w}^T\mathbf{B}\mathbf{w})\\ \mathbf{B}\mathbf{w}&=\mathbf{S}\mathbf{w}\cdot\frac{\mathbf{w}^{T} \mathbf{B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S} \mathbf{w}}\\ \mathbf{B}\mathbf{w}&=J(\mathbf{w})\cdot\mathbf{S} \mathbf{w}\quad (*) \end{aligned} Bw(wTSw)BwBw=Sw(wTBw)=SwwTSwwTBw=J(w)Sw()
S − 1 \mathbf{S}^{-1} S1 存在,则
S − 1 B w = J ( w ) ⋅ w \mathbf{S}^{-1}\mathbf{B}\mathbf{w}=J(\mathbf{w})\cdot\mathbf{w} S1Bw=J(w)w
故要求最大 J ( w ) J(\mathbf{w}) J(w) ,只需 S − 1 B \mathbf{S}^{-1}\mathbf{B} S1B 的最大特征值, w \mathbf{w} w 为其特征向量。

☆ 不求特征向量求出 w \mathbf{w} w 的方法

B = ( μ 1 − μ 2 ) ( μ 1 − μ 2 ) T \mathbf{B}=(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{T} B=(μ1μ2)(μ1μ2)T 代入 ( ∗ ) (*) ()
( μ 1 − μ 2 ) ( μ 1 − μ 2 ) T w = J ( w ) ⋅ S w S − 1 ( μ 1 − μ 2 ) [ ( μ 1 − μ 2 ) T w J ( w ) ] = w \begin{aligned} (\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{T}\mathbf{w} &=J(\mathbf{w})\cdot\mathbf{S} \mathbf{w}\\ \mathbf{S}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})[\frac{(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2})^{T}\mathbf{w}}{J(\mathbf{w})}]&=\mathbf{w} \end{aligned} (μ1μ2)(μ1μ2)TwS1(μ1μ2)[J(w)(μ1μ2)Tw]=J(w)Sw=w
故只需计算 S − 1 ( μ 1 − μ 2 ) \mathbf{S}^{-1}(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}) S1(μ1μ2),再单位化。

20.2 Kernel LDA:

事实1:如果 ( S ϕ − 1 B ϕ ) w = λ w \left(\mathbf{S}_{\phi}^{-1} \mathbf{B}_{\phi}\right) \mathbf{w}=\lambda \mathbf{w} (Sϕ1Bϕ)w=λw,那么 w = ∑ j = 1 n a j ϕ ( x j ) \mathbf{w}=\sum\limits_{j=1}^na_j\phi(\mathbf{x}_j) w=j=1najϕ(xj),证明见讲稿最后两页。

a = ( a 1 , ⋯   , a n ) T \mathbf{a}=(a_1,\cdots,a_n)^T a=(a1,,an)T 是“事实1”中的向量。

下面将 max ⁡ w J ( w ) = ( m 1 − m 2 ) 2 s 1 2 + s 2 2 = w T B ϕ w w T S ϕ w \max\limits_{\mathbf{w}}J(\mathbf{w})=\frac{(m_1-m_2)^2}{s_1^2+s_2^2}=\frac{\mathbf{w}^{T} \mathbf{B}_{\phi} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S}_{\phi} \mathbf{w}} wmaxJ(w)=s12+s22(m1m2)2=wTSϕwwTBϕw 的问题转化为 max ⁡ G ( a ) \max G(\mathbf{a}) maxG(a) s.t. 使用 K \mathbf{K} K 能求解。

注意到:
m i = w T μ i ϕ = ( ∑ j = 1 n a j ϕ ( x j ) ) T ( 1 n i ∑ x i ∈ D i ϕ ( x k ) ) = 1 n i ∑ j = 1 n ∑ x k ∈ D i a j ϕ ( x j ) T ϕ ( x k ) = 1 n i ∑ j = 1 n ∑ x k ∈ D i a j K ( x j , x k ) = a T m i \begin{aligned} m_{i}=\mathbf{w}^{T} \boldsymbol{\mu}_{i}^{\phi} &=\left(\sum_{j=1}^{n} a_{j} \phi\left(\mathbf{x}_{j}\right)\right)^{T}\left(\frac{1}{n_{i}} \sum_{\mathbf{x}_{i} \in \mathbf{D}_{i}} \phi\left(\mathbf{x}_{k}\right)\right) \\ &=\frac{1}{n_{i}} \sum_{j=1}^{n} \sum_{\mathbf{x}_{k} \in \mathbf{D}_{i}} a_{j} \phi\left(\mathbf{x}_{j}\right)^{T} \phi\left(\mathbf{x}_{k}\right) \\ &=\frac{1}{n_{i}} \sum_{j=1}^{n} \sum_{\mathbf{x}_{k} \in \mathbf{D}_{i}} a_{j} K\left(\mathbf{x}_{j}, \mathbf{x}_{k}\right) \\ &=\mathbf{a}^{T} \mathbf{m}_{i} \end{aligned} mi=wTμiϕ=(j=1najϕ(xj))T(ni1xiDiϕ(xk))=ni1j=1nxkDiajϕ(xj)Tϕ(xk)=ni1j=1nxkDiajK(xj,xk)=aTmi
其中,
m i = 1 n i ( ∑ x k ∈ D i K ( x 1 , x k ) ∑ x k ∈ D i K ( x 2 , x k ) ⋮ ∑ x k ∈ D i K ( x n , x k ) ) n × 1 \mathbf{m}_{i}=\frac{1}{n_{i}}\left(\begin{array}{c} \sum\limits_{\mathbf{x}_{k} \in \mathbf{D}_{i}} K\left(\mathbf{x}_{1}, \mathbf{x}_{k}\right) \\ \sum\limits_{\mathbf{x}_{k} \in \mathbf{D}_{i}} K\left(\mathbf{x}_{2}, \mathbf{x}_{k}\right) \\ \vdots \\ \sum\limits_{\mathbf{x}_{k} \in \mathbf{D}_{i}} K\left(\mathbf{x}_{n}, \mathbf{x}_{k}\right) \end{array}\right)_{n\times 1} mi=ni1xkDiK(x1,xk)xkDiK(x2,xk)xkDiK(xn,xk)n×1

( m 1 − m 2 ) 2 = ( w T μ 1 ϕ − w T μ 2 ϕ ) 2 = ( a T m 1 − a T m 2 ) 2 = a T ( m 1 − m 2 ) ( m 1 − m 2 ) T a = a T M a \begin{aligned} \left(m_{1}-m_{2}\right)^{2} &=\left(\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}-\mathbf{w}^{T} \boldsymbol{\mu}_{2}^{\phi}\right)^{2} \\ &=\left(\mathbf{a}^{T} \mathbf{m}_{1}-\mathbf{a}^{T} \mathbf{m}_{2}\right)^{2} \\ &=\mathbf{a}^{T}\left(\mathbf{m}_{1}-\mathbf{m}_{2}\right)\left(\mathbf{m}_{1}-\mathbf{m}_{2}\right)^{T} \mathbf{a} \\ &=\mathbf{a}^{T} \mathbf{M a} \end{aligned} (m1m2)2=(wTμ1ϕwTμ2ϕ)2=(aTm1aTm2)2=aT(m1m2)(m1m2)Ta=aTMa
M \mathbf{M} M 被称为核类间扩散矩阵)
s 1 2 = ∑ x i ∈ D 1 ∥ w T ϕ ( x i ) − w T μ 1 ϕ ∥ 2 = ∑ x i ∈ D 1 ∥ w T ϕ ( x i ) ∥ 2 − 2 ∑ x i ∈ D 1 w T ϕ ( x i ) ⋅ w T μ 1 ϕ + ∑ x i ∈ D 1 ∥ w T μ 1 ϕ ∥ 2 = ( ∑ x i ∈ D 1 ∥ ∑ j = 1 n a j ϕ ( x j ) T ϕ ( x i ) ∥ 2 ) − 2 ⋅ n 1 ⋅ ∥ w T μ 1 ϕ ∥ 2 + n 1 ⋅ ∥ w T μ 1 ϕ ∥ 2 = ( ∑ x i ∈ D 1 a T K i K i T a ) − n 1 ⋅ a T m 1 m 1 T a = a T ( ( ∑ x i ∈ D 1 K i K i T ) − n 1 m 1 m 1 T ) a = a T N 1 a \begin{aligned} s_{1}^{2} &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left\|\mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right)-\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}\right\|^{2} \\ &=\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left\|\mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right)\right\|^{2}-2 \sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}} \mathbf{w}^{T} \phi\left(\mathbf{x}_{i}\right) \cdot \mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}+\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left\|\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}\right\|^{2} \\ &=\left(\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}}\left\|\sum_{j=1}^{n} a_{j} \phi\left(\mathbf{x}_{j}\right)^{T} \phi\left(\mathbf{x}_{i}\right)\right\|^{2}\right)-2 \cdot n_{1} \cdot\left\|\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}\right\|^{2}+n_{1} \cdot\left\|\mathbf{w}^{T} \boldsymbol{\mu}_{1}^{\phi}\right\|^{2}\\ &=\left(\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}} \mathbf{a}^{T} \mathbf{K}_{i} \mathbf{K}_{i}^{T} \mathbf{a}\right)-n_{1} \cdot \mathbf{a}^{T} \mathbf{m}_{1} \mathbf{m}_{1}^{T} \mathbf{a}\\ &=\mathbf{a}^{T}\left(\left(\sum_{\mathbf{x}_{i} \in \mathbf{D}_{1}} \mathbf{K}_{i} \mathbf{K}_{i}^{T}\right)-n_{1} \mathbf{m}_{1} \mathbf{m}_{1}^{T}\right) \mathbf{a} \\ &=\mathbf{a}^{T} \mathbf{N}_{1} \mathbf{a} \end{aligned} s12=xiD1wTϕ(xi)wTμ1ϕ2=xiD1wTϕ(xi)22xiD1wTϕ(xi)wTμ1ϕ+xiD1wTμ1ϕ2=xiD1j=1najϕ(xj)Tϕ(xi)22n1wTμ1ϕ2+n1wTμ1ϕ2=(xiD1aTKiKiTa)n1aTm1m1Ta=aT((xiD1KiKiT)n1m1m1T)a=aTN1a
类似地,令 N 2 = ( ∑ x i ∈ D 2 K i K i T − n 2 m 2 m 2 T ) \mathbf{N}_2=\left(\sum\limits_{\mathbf{x}_{i} \in \mathbf{D}_{2}} \mathbf{K}_{i} \mathbf{K}_{i}^{T}-n_{2} \mathbf{m}_{2} \mathbf{m}_{2}^{T}\right) N2=(xiD2KiKiTn2m2m2T)

s 1 2 + s 2 2 = a T ( N 1 + N 2 ) a = a T N a s_1^2+s_2^2=\mathbf{a}^{T} (\mathbf{N}_{1}+\mathbf{N}_{2}) \mathbf{a}=\mathbf{a}^{T}\mathbf{N} \mathbf{a} s12+s22=aT(N1+N2)a=aTNa

故: J ( w ) = a T M a a T N a : = G ( a ) J(\mathbf{w})=\frac{\mathbf{a}^{T}\mathbf{M} \mathbf{a}}{\mathbf{a}^{T}\mathbf{N} \mathbf{a}}:=G(\mathbf{a}) J(w)=aTNaaTMa:=G(a)

类似 20.1, M a = λ N a \mathbf{M} \mathbf{a}=\lambda\mathbf{N} \mathbf{a} Ma=λNa

  • N − 1 \mathbf{N} ^{-1} N1 存在, N − 1 M a = λ a \mathbf{N}^{-1} \mathbf{M} \mathbf{a}=\lambda \mathbf{a} N1Ma=λa λ \lambda λ N − 1 M \mathbf{N}^{-1} \mathbf{M} N1M 的最大特征值, a \mathbf{a} a 是相应的特征向量。

  • N − 1 \mathbf{N} ^{-1} N1 不存在,MATLAB 求广义逆

最后考查 w T w = 1 \mathbf{w}^T\mathbf{w}=1 wTw=1,即

( ∑ j = 1 n a j ϕ ( x j ) ) T ( ∑ i = 1 n a i ϕ ( x i ) ) = 1 ∑ j = 1 n ∑ i = 1 n a j a i ϕ ( x j ) T ϕ ( x i ) = 1 ∑ j = 1 n ∑ i = 1 n a j a i K ( x i , x j ) = 1 a T K a = 1 \begin{aligned} (\sum\limits_{j=1}^na_j\phi(\mathbf{x}_j))^T(\sum\limits_{i=1}^na_i\phi(\mathbf{x}_i))&=1\\ \sum\limits_{j=1}^n\sum\limits_{i=1}^na_ja_i\phi(\mathbf{x}_j)^T\phi(\mathbf{x}_i)&=1\\ \sum\limits_{j=1}^n\sum\limits_{i=1}^na_ja_iK(\mathbf{x}_i,\mathbf{x}_j)&=1\\ \mathbf{a}^T\mathbf{K}\mathbf{a}&=1 \end{aligned} (j=1najϕ(xj))T(i=1naiϕ(xi))j=1ni=1najaiϕ(xj)Tϕ(xi)j=1ni=1najaiK(xi,xj)aTKa=1=1=1=1
求出 N − 1 M \mathbf{N}^{-1} \mathbf{M} N1M 的特征向量 a \mathbf{a} a 后, a ← a a T K a \mathbf{a}\leftarrow \frac{\mathbf{a}}{\sqrt{\mathbf{a}^T\mathbf{K}\mathbf{a}}} aaTKa a 以保证 w T w = 1 \mathbf{w}^T\mathbf{w}=1 wTw=1

Chapter 21: Support Vector Machines (SVM)

21.1 支撑向量与余量

Set up D = { ( x i , y i ) } i = 1 n , x i ∈ R d , y i ∈ { − 1 , 1 } \mathbf{D}=\{(\mathbf{x}_i,y_i) \}_{i=1}^n,\mathbf{x}_i \in \mathbb{R}^d,y_i \in \{-1,1 \} D={(xi,yi)}i=1n,xiRd,yi{1,1},仅两类数据。

  • 超平面 (hyperplanes, d − 1 d-1 d1 维): h ( x ) : = w T x + b = w 1 x 1 + ⋯ + w d x d + b h(\mathbf{x}):=\mathbf{w}^T\mathbf{x}+b=w_1x_1+ \cdots +w_dx_d+b h(x):=wTx+b=w1x1++wdxd+b

    其中, w \mathbf{w} w 是法向量, − b w i -{b\over w_i} wib x i x_i xi 轴上的截距。

  • D \mathbf{D} D 称为是线性可分的,如果存在 h ( x ) h(\mathbf{x}) h(x) 使得对所有 y i = 1 y_i=1 yi=1 的点 x i \mathbf{x}_i xi h ( x i ) > 0 h(\mathbf{x}_i)>0 h(xi)>0 ,且对所有 y i = − 1 y_i=-1 yi=1 的点 x i \mathbf{x}_i xi h ( x i ) < 0 h(\mathbf{x}_i)<0 h(xi)<0 ,并将此 h ( x ) h(\mathbf{x}) h(x) 称为分离超平面。

    Remark:对于线性可分的 D \mathbf{D} D ,分离超平面有无穷多个。

  • 点到超平面的距离:
    x = x p + x r = x p + r ⋅ w ∣ ∣ w ∣ ∣ h ( x ) = h ( x p + r w ∥ w ∥ ) = w T ( x p + r w ∥ w ∥ ) + b = w T x p + b ⏟ h ( x p ) + r w T w ∥ w ∥ = h ( x p ) ⏟ 0 + r ∥ w ∥ = r ∥ w ∥ \mathbf{x}=\mathbf{x}_p+\mathbf{x}_r=\mathbf{x}_p+r\cdot\frac{\mathbf{w}}{||\mathbf{w}||}\\ \begin{aligned} h(\mathbf{x}) &=h\left(\mathbf{x}_{p}+r \frac{\mathbf{w}}{\|\mathbf{w}\|}\right) \\ &=\mathbf{w}^{T}\left(\mathbf{x}_{p}+r \frac{\mathbf{w}}{\|\mathbf{w}\|}\right)+b \\ &=\underbrace{\mathbf{w}^{T}\mathbf{x}_{p}+b}_{h\left(\mathbf{x}_{p}\right)}+r \frac{\mathbf{w}^{T} \mathbf{w}}{\|\mathbf{w}\|}\\ &=\underbrace{h\left(\mathbf{x}_{p}\right)}_{0}+r\|\mathbf{w}\| \\ &=r\|\mathbf{w}\| \end{aligned} x=xp+xr=xp+rwwh(x)=h(xp+rww)=wT(xp+rww)+b=h(xp) wTxp+b+rwwTw=0 h(xp)+rw=rw

    ∴ r = h ( x ) ∥ w ∥ , ∣ r ∣ = ∣ h ( x ∣ ) ∥ w ∥ \therefore r=\frac{h(\mathbf{x})}{\|\mathbf{w}\|},|r|=\frac{|h(\mathbf{x}|)}{\|\mathbf{w}\|} r=wh(x),r=wh(x)

∀ x i ∈ D \forall \mathbf{x}_i \in \mathbf{D} xiD h ( x ) h(\mathbf{x}) h(x) 的距离是 y i h ( x i ) ∥ w ∥ y_i\frac{h(\mathbf{x}_i)}{\|\mathbf{w}\|} yiwh(xi)

在这里插入图片描述

  • 给定线性可分的 D \mathbf{D} D ,及分离超平面 h ( x ) h(\mathbf{x}) h(x) ,定义余量:
    δ ∗ = min ⁡ x i { y i ( w T x i + b ) ∥ w ∥ } \delta^*=\min\limits_{\mathbf{x}_i}\{\frac{y_i(\mathbf{w}^T\mathbf{x}_i+b)}{\|\mathbf{w}\|} \} δ=ximin{wyi(wTxi+b)}
    D \mathbf{D} D 中点到 h ( x ) h(\mathbf{x}) h(x) 距离的最小值,使得该 δ ∗ \delta^* δ 取到的数据点 x i \mathbf{x}_i xi 被称为支撑向量(可能不唯一)。

  • 标准超平面:对 ∀ h ( x ) = w T x + b \forall h(\mathbf{x})=\mathbf{w}^T\mathbf{x}+b h(x)=wTx+b,以及任意 s ∈ R ∖ { 0 } s\in \mathbb{R}\setminus \{0\} sR{0} s ( w T x + b ) = 0 s(\mathbf{w}^T\mathbf{x}+b)=0 s(wTx+b)=0 h ( x ) = 0 h(\mathbf{x})=0 h(x)=0 是同一超平面。

    x ∗ \mathbf{x}^* x 是支撑向量,若 s y ∗ ( w T x ∗ + b ) = 1   ( 1 ) sy^*(\mathbf{w}^T\mathbf{x}^*+b)=1\ (1) sy(wTx+b)=1 (1) ,则称 s h ( x ) = 0 sh(\mathbf{x})=0 sh(x)=0 是标准超平面。

    ( 1 ) (1) (1) 可得: s = 1 y ∗ ( w T x ∗ + b ) = 1 y ∗ h ( x ∗ ) s=\frac{1}{y^*(\mathbf{w}^T\mathbf{x}^*+b)}=\frac{1}{y^*h(\mathbf{x}^*)} s=y(wTx+b)1=yh(x)1

    此时,对于 s h ( x ) = 0 sh(\mathbf{x})=0 sh(x)=0 ,余量 δ ∗ = y ∗ h ( x ∗ ) ∥ w ∥ = 1 ∥ w ∥ \delta^*=\frac{y^*h(\mathbf{x}^*)}{\|\mathbf{w}\|}=\frac{1}{\|\mathbf{w}\|} δ=wyh(x)=w1

事实:如果 w T x + b = 0 \mathbf{w}^T\mathbf{x}+b=0 wTx+b=0 是标准超平面,对 ∀ x i \forall \mathbf{x}_i xi ,一定有 y i ( w T x i + b ) ≥ 1 y_i(\mathbf{w}^T\mathbf{x}_i+b)\ge1 yi(wTxi+b)1

21.2 SVM: 线性可分情形

目标:寻找标准分离超平面使得其余量最大,即 h ∗ = arg ⁡ max ⁡ w , b { 1 w } h^*=\arg\max\limits_{\mathbf{w},b}\{\frac{1}{\mathbf{w}} \} h=argw,bmax{w1}

转为优化问题:
min ⁡ w , b   { ∥ w ∥ 2 2 } s.t.  y i ( w T x i + b ) ≥ 1 , ∀ ( x i , y i ) ∈ D \begin{aligned} &\min\limits_{\mathbf{w},b}\ \{\frac{\|\mathbf{w}\|^2}{2} \}\\ &\text{s.t.} \ y_i(\mathbf{w}^T\mathbf{x}_i+b)\ge1,\forall(\mathbf{x}_i,y_i)\in \mathbf{D} \end{aligned} w,bmin {2w2}s.t. yi(wTxi+b)1,(xi,yi)D
引入 Lagrange 乘子 α i ≥ 0 \alpha_i\ge0 αi0 与 KKT 条件:
α i ( y i ( w T x i + b ) − 1 ) = 0 \alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1)=0 αi(yi(wTxi+b)1)=0
定义:
L ( w ) = 1 2 ∥ w ∥ 2 − ∑ i = 1 n α i ( y i ( w T x i + b ) − 1 )   ( 2 ) L(\mathbf{w})=\frac{1}{2}\|\mathbf{w}\|^2-\sum\limits_{i=1}^{n}\alpha_i(y_i(\mathbf{w}^T\mathbf{x}_i+b)-1)\ (2) L(w)=21w2i=1nαi(yi(wTxi+b)1) (2)

∂ ∂ w L = w − ∑ i = 1 n α i y i x i = 0   ( 3 ) ∂ ∂ b L = ∑ i = 1 n α i y i = 0   ( 4 ) \begin{array}{l} \frac{\partial}{\partial \mathbf{w}} L=\mathbf{w}-\sum\limits_{i=1}^{n} \alpha_{i} y_{i} \mathbf{x}_{i}=\mathbf{0}\ (3) \\ \frac{\partial}{\partial b} L=\sum\limits_{i=1}^{n} \alpha_{i} y_{i}=0\ (4) \end{array} wL=wi=1nαiyixi=0 (3)bL=i=1nαiyi=0 (4)

将 (3)(4) 代入 (2) 得:
L d u a l = 1 2 w T w − w T ( ∑ i = 1 n α i y i x i ⏟ w ) − b ∑ i = 1 n α i y i ⏟ 0 + ∑ i = 1 n α i = − 1 2 w T w + ∑ i = 1 n α i = ∑ i = 1 n α i − 1 2 ∑ i = 1 n ∑ j = 1 n α i α j y i y j x i T x j \begin{aligned} L_{d u a l} &=\frac{1}{2} \mathbf{w}^{T} \mathbf{w}-\mathbf{w}^{T}(\underbrace{\sum_{i=1}^{n} \alpha_{i} y_{i} \mathbf{x}_{i}}_{\mathbf{w}})-b\underbrace{ \sum_{i=1}^{n} \alpha_{i} y_{i}}_{0}+\sum_{i=1}^{n} \alpha_{i}\\ &=-\frac{1}{2} \mathbf{w}^{T} \mathbf{w}+\sum_{i=1}^{n} \alpha_{i}\\ &=\sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} \mathbf{x}_{i}^{T} \mathbf{x}_{j} \end{aligned} Ldual=21wTwwT(w i=1nαiyixi)b0 i=1nαiyi+i=1nαi=21wTw+i=1nαi=i=1nαi21i=1nj=1nαiαjyiyjxiTxj
故对偶问题为:
max ⁡ α L d u a l = ∑ i = 1 n α i − 1 2 ∑ i = 1 n ∑ j = 1 n α i α j y i y j x i T x j s.t.  ∑ i = 1 n α i y i = 0 , α i ≥ 0 , ∀ i \begin{aligned} &\max\limits_{\alpha} L_{dual}=\sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} \mathbf{x}_{i}^{T} \mathbf{x}_{j}\\ &\text{s.t.}\ \sum\limits_{i=1}^{n} \alpha_{i} y_{i}=0,\alpha_i \ge0, \forall i \end{aligned} αmaxLdual=i=1nαi21i=1nj=1nαiαjyiyjxiTxjs.t. i=1nαiyi=0,αi0,i
利用二次规划解出 Dual: α 1 , ⋯   , α n \alpha_1,\cdots,\alpha_n α1,,αn

代入 (3) 可得: w = ∑ i = 1 n α i y i x i \mathbf{w}=\sum\limits_{i=1}^{n} \alpha_{i} y_{i} \mathbf{x}_{i} w=i=1nαiyixi

使得 α i > 0 \alpha_i>0 αi>0 的数据 x i \mathbf{x}_i xi 给出支撑向量。

对于每一个支撑向量: y i ( w T x i + b ) − 1 ⇒ b = 1 y i − w T x i y_i(\mathbf{w}^T\mathbf{x}_i+b)-1\Rightarrow b=\frac{1}{y_i}-\mathbf{w}^T\mathbf{x}_i yi(wTxi+b)1b=yi1wTxi

b = a v g α i > 0 { b i } b=\mathop{avg}_{\alpha_i>0}\{b_i\} b=avgαi>0{bi}

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

yyywxk

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值