转载请注明出处: SVM算法理论推导及python实现
本文面向的读者为掌握SVM基础前置知识如读过《统计学习方法》,并希望对SMO(Sequential Minimal Optimization)细节有更深入了解的人群。因为笔者想要实现一个简易的SVM,为了搞懂这部分费了不少工夫,所以写下这一篇嚼过的文章,目的是让读者跟着顺序阅读一定能弄懂。自己的行文习惯是简单的地方不能说太复杂,复杂的地方一定会说清楚
一, 推导至SMO要解决的SVM对偶形式
对于含soft margin的支持向量机, 其primitive problem:
(1)
min
w
,
b
,
ξ
1
2
∥
w
∥
2
+
C
∑
i
=
1
N
ξ
i
s
.
t
.
y
i
(
w
⋅
x
i
+
b
)
≥
1
−
ξ
i
,
i
=
1
,
2
,
⋯
 
,
N
     
ξ
i
≥
0
,
i
=
1
,
2
,
⋯
 
,
N
\begin{aligned} \\ & \min_{w,b,\xi} \quad \dfrac{1}{2} \| w \|^{2} + C \sum_{i=1}^{N} \xi_{i} \tag{1} \\ & s.t. \quad y_{i} \left( w \cdot x_{i} + b \right) \geq 1 - \xi_{i}, \quad i=1,2, \cdots, N \\ & \quad \quad \; \; \, \xi_{i} \ge 0, \quad i=1,2, \cdots, N \end{aligned}
w,b,ξmin21∥w∥2+Ci=1∑Nξis.t.yi(w⋅xi+b)≥1−ξi,i=1,2,⋯,Nξi≥0,i=1,2,⋯,N(1)
求解此原始问题:
1, 构建Lagrange Function
引入拉格朗日乘子
α
i
≥
0
,
μ
i
≥
0
,
i
=
1
,
2
,
⋯
 
,
N
\alpha_{i} \ge 0, \mu_{i} \ge 0, i = 1, 2, \cdots, N
αi≥0,μi≥0,i=1,2,⋯,N构建Lagrange Function:
(2)
L
(
w
,
b
,
ξ
,
α
,
μ
)
=
1
2
∥
w
∥
2
+
C
∑
i
=
1
N
ξ
i
+
∑
i
=
1
N
α
i
(
−
y
i
(
w
⋅
x
i
+
b
)
+
1
−
ξ
i
)
−
∑
i
=
1
N
μ
i
ξ
i
\begin{aligned} \\ L(w,b,\xi,\alpha,\mu) &= \frac{1}{2} \| w \|^{2} + C \sum_{i=1}^{N} \xi_{i} + \sum_{i=1}^{N} \alpha_{i} (- y_{i} ( w \cdot x_{i} + b ) + 1 - \xi_{i} ) - \sum_{i=1}^{N} \mu_{i} \xi_{i} \tag{2} \end{aligned}
L(w,b,ξ,α,μ)=21∥w∥2+Ci=1∑Nξi+i=1∑Nαi(−yi(w⋅xi+b)+1−ξi)−i=1∑Nμiξi(2)
其中,
α
=
(
α
1
,
α
2
,
⋯
 
,
α
N
)
T
\alpha = \left( \alpha_{1}, \alpha_{2}, \cdots, \alpha_{N} \right)^{T}
α=(α1,α2,⋯,αN)T以及
μ
=
(
μ
1
,
μ
2
,
⋯
 
,
μ
N
)
T
\mu = \left( \mu_{1}, \mu_{2}, \cdots, \mu_{N} \right)^{T}
μ=(μ1,μ2,⋯,μN)T为lagrange multiplier , 它们的每个分量都是非负数
2,转化为Lagrange Dual Problem
具体原理与KKT条件推导可看我上一篇博文:SVM之拉格朗日对偶问题与KKT条件推导
现在dual problem:
(3)
max
α
,
μ
min
w
,
b
,
ξ
L
(
w
,
b
,
ξ
,
α
,
μ
)
\max_{\alpha,\mu} \min_{w,b,\xi} L(w, b,\xi,\alpha,\mu) \tag{3}
α,μmaxw,b,ξminL(w,b,ξ,α,μ)(3)
3,先求内层的min
由于把
α
,
μ
\alpha, \mu
α,μ 都看作常量, 要求最小值就直接求偏导:
(4)
∇
w
L
(
w
,
b
,
ξ
,
α
,
μ
)
=
w
−
∑
i
=
1
N
α
i
y
i
x
i
=
0
\nabla_{w} L( w, b, \xi, \alpha, \mu) = w - \sum_{i=1}^{N} \alpha_{i} y_{i} x_{i} = 0 \tag{4}
∇wL(w,b,ξ,α,μ)=w−i=1∑Nαiyixi=0(4)
(5)
∇
b
L
(
w
,
b
,
ξ
,
α
,
μ
)
=
−
∑
i
=
1
N
α
i
y
i
=
0
\nabla_{b} L \left( w, b, \xi, \alpha, \mu \right) = -\sum_{i=1}^{N} \alpha_{i} y_{i} = 0 \tag{5}
∇bL(w,b,ξ,α,μ)=−i=1∑Nαiyi=0(5)
(6)
∇
ξ
i
L
(
w
,
b
,
ξ
,
α
,
μ
)
=
C
−
α
i
−
μ
i
=
0
\nabla_{\xi_{i}} L \left( w, b, \xi, \alpha, \mu \right) = C - \alpha_{i} - \mu_{i} = 0 \tag{6}
∇ξiL(w,b,ξ,α,μ)=C−αi−μi=0(6)
得:
(7)
w
=
∑
i
=
1
N
α
i
y
i
x
i
w = \sum_{i=1}^N \alpha_i y_i x_i \tag{7}
w=i=1∑Nαiyixi(7)
(8)
∑
i
=
1
N
α
i
y
i
=
0
\sum_{i=1}^N \alpha_i y_i = 0 \tag{8}
i=1∑Nαiyi=0(8)
(9)
C
−
α
i
−
μ
i
=
0
C - \alpha_i - \mu_i = 0 \tag{9}
C−αi−μi=0(9)
把(7)-(9) 代入(2)可得(这里引入了kernel function):
(10)
min
w
,
b
,
ξ
L
(
w
,
b
,
ξ
,
α
,
μ
)
=
−
1
2
∑
i
=
1
N
∑
i
=
1
N
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
+
∑
i
=
1
N
α
i
\min_{w,b,\xi} L(w,b,\xi,\alpha,\mu) = -\frac{1}{2}\sum_{i=1}^N\sum_{i=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) + \sum_{i=1}^N \alpha_i \tag{10}
w,b,ξminL(w,b,ξ,α,μ)=−21i=1∑Ni=1∑NαiαjyiyjK(xi,xj)+i=1∑Nαi(10)
4, 求解外层的max
对式(10)求解max:
(11)
max
α
,
μ
−
1
2
∑
i
=
1
N
∑
i
=
1
N
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
+
∑
i
=
1
N
α
i
\max_{\alpha, \mu} -\frac{1}{2} \sum_{i=1}^N\sum_{i=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) + \sum_{i=1}^N \alpha_i \tag{11}
α,μmax−21i=1∑Ni=1∑NαiαjyiyjK(xi,xj)+i=1∑Nαi(11)
(12)
s
.
t
.
∑
i
=
1
N
α
i
y
i
=
0
s.t. \quad \sum_{i=1}^N \alpha_i y_i = 0 \tag{12}
s.t.i=1∑Nαiyi=0(12)
(13)
s
.
t
.
C
−
α
i
−
μ
i
=
0
s.t. \quad C-\alpha_i - \mu_i = 0 \tag{13}
s.t.C−αi−μi=0(13)
(14)
s
.
t
.
α
i
≥
0
s.t. \quad \alpha_i \ge 0 \tag{14}
s.t.αi≥0(14)
(15)
s
.
t
.
μ
i
≥
0
s.t. \quad \mu_i \ge 0 \tag{15}
s.t.μi≥0(15)
式(13)-(15) 可简化为:
(16)
0
≤
α
i
≤
C
0 \le \alpha_i \le C \tag{16}
0≤αi≤C(16)
5,最终形式:一个凸二次规划的对偶问题
(17)
min
α
1
2
∑
i
=
1
N
∑
j
=
1
N
α
i
α
j
y
i
y
j
K
(
x
i
,
x
j
)
−
∑
i
=
1
N
α
i
\min_\alpha \quad \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \alpha_i \alpha_j y_i y_j K(x_i, x_j) - \sum_{i=1}^N \alpha_i \tag{17}
αmin21i=1∑Nj=1∑NαiαjyiyjK(xi,xj)−i=1∑Nαi(17)
(18)
s
.
t
.
∑
i
=
1
N
α
i
y
i
=
0
s.t. \quad \sum_{i=1}^N \alpha_i y_i = 0 \tag{18}
s.t.i=1∑Nαiyi=0(18)
(19)
s
.
t
.
0
≤
α
i
≤
C
,
i
=
1
,
2
,
⋯
 
,
N
s.t. \quad 0 \le \alpha_i \le C, \quad i=1,2,\cdots,N \tag{19}
s.t.0≤αi≤C,i=1,2,⋯,N(19)
收敛的值需满足的KKT condition(求解SMO时有用):
(20)
∇
w
L
(
w
∗
,
b
∗
,
ξ
∗
,
α
∗
,
μ
∗
)
=
w
∗
−
∑
i
=
1
N
α
i
∗
y
i
x
i
=
0
\nabla_{w} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = w^* - \sum_{i=1}^{N} \alpha_{i}^* y_{i} x_{i} = 0 \tag{20}
∇wL(w∗,b∗,ξ∗,α∗,μ∗)=w∗−i=1∑Nαi∗yixi=0(20)
(21)
∇
b
L
(
w
∗
,
b
∗
,
ξ
∗
,
α
∗
,
μ
∗
)
=
−
∑
i
=
1
N
α
i
∗
y
i
=
0
\nabla_{b} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = -\sum_{i=1}^{N} \alpha_{i}^* y_{i} = 0 \tag{21}
∇bL(w∗,b∗,ξ∗,α∗,μ∗)=−i=1∑Nαi∗yi=0(21)
(22)
∇
ξ
i
L
(
w
∗
,
b
∗
,
ξ
∗
,
α
∗
,
μ
∗
)
=
C
−
α
i
∗
−
μ
i
∗
=
0
\nabla_{\xi_{i}} L( w^*, b^*, \xi^*, \alpha^*, \mu^*) = C - \alpha_{i}^* - \mu_{i}^* = 0 \tag{22}
∇ξiL(w∗,b∗,ξ∗,α∗,μ∗)=C−αi∗−μi∗=0(22)
(23)
α
i
∗
≥
0
\alpha_i^* \ge 0 \tag{23}
αi∗≥0(23)
(24)
1
−
ξ
i
∗
+
y
i
(
w
∗
⋅
x
∗
+
b
∗
)
≤
0
1 - \xi_i^* +y_i(w^*\cdot x^*+b^*) \le 0 \tag{24}
1−ξi∗+yi(w∗⋅x∗+b∗)≤0(24)
(25)
α
i
∗
(
1
−
ξ
i
∗
+
y
i
(
w
∗
⋅
x
∗
+
b
∗
)
)
=
0
\alpha_i^*(1 - \xi_i^* + y_i ( w^*\cdot x^*+b^*)) = 0 \tag{25}
αi∗(1−ξi∗+yi(w∗⋅x∗+b∗))=0(25)
(26)
μ
i
∗
≥
0
\mu_i^* \ge 0 \tag{26}
μi∗≥0(26)
(27)
ξ
i
∗
≤
0
\xi_i^* \le 0 \tag{27}
ξi∗≤0(27)
(28)
μ
i
∗
ξ
i
∗
=
0
\mu_i^* \xi_i^* = 0 \tag{28}
μi∗ξi∗=0(28)
一共3个偏导为0 + 2*3个乘子与不等式约束间满足的条件 = 9个KKT 条件
二,切入正题:SMO算法完整推导
SMO算法,首先选择 α \alpha α的两个分量来求解式(19)的子问题,思路是不断迭代求解原问题的子问题来逼近原始问题的解, 具体怎么选取这两个分量待会儿再说
1, 选取两个 α \alpha α的分量,求解式(19)子问题,目的是获取对这两个分量进行更新的方法
(29)
min
α
1
,
α
2
W
(
α
1
,
α
2
)
=
1
2
α
1
2
K
11
+
1
2
α
2
2
K
22
+
α
1
α
2
y
1
y
2
K
12
+
α
1
y
1
∑
i
=
3
N
α
i
y
i
K
1
i
+
α
2
y
2
∑
i
=
3
N
α
i
y
i
K
2
i
−
α
1
−
α
2
\begin{aligned} \min_{\alpha1, \alpha2} \quad W(\alpha_1,\alpha_2) = \frac{1}{2}\alpha_1^2K_{11} + \frac{1}{2}\alpha_2^2K_{22} + \alpha_1\alpha_2y_1y_2K_{12} \\ + \alpha_1y_1\sum_{i=3}^N\alpha_iy_iK_{1i} + \alpha_2y_2\sum_{i=3}^N\alpha_iy_iK_{2i} - \alpha_1 - \alpha_2 \tag{29} \end{aligned}
α1,α2minW(α1,α2)=21α12K11+21α22K22+α1α2y1y2K12+α1y1i=3∑NαiyiK1i+α2y2i=3∑NαiyiK2i−α1−α2(29)
(30)
s
.
t
.
α
1
y
1
+
α
2
y
2
=
−
∑
i
=
3
N
y
i
α
i
=
ς
s.t. \quad \alpha_1y_1 + \alpha_2y_2 = -\sum_{i=3}^Ny_i\alpha_i = \varsigma \tag{30}
s.t.α1y1+α2y2=−i=3∑Nyiαi=ς(30)
(31)
s
.
t
.
     
0
≤
α
i
≤
C
,
i
=
1
,
2
s.t. \quad \quad \; \; \, 0 \le \alpha_i \le C, \quad i=1,2 \tag{31}
s.t.0≤αi≤C,i=1,2(31)
1.1 换元
有两个变量,那先通过式(30)的变形式
α
1
=
(
ς
−
α
2
y
2
)
y
1
\alpha_1 = (\varsigma - \alpha_2y_2)y_1
α1=(ς−α2y2)y1代入式(29)换为只包含
α
2
\alpha_2
α2的式子:
(32)
min
α
2
W
(
α
2
)
=
1
2
(
ς
−
α
2
y
2
)
2
K
11
+
1
2
α
2
2
K
22
+
(
ς
−
α
2
y
2
)
α
2
y
2
K
12
+
(
ς
−
α
2
y
2
)
∑
i
=
3
N
α
i
y
i
K
1
i
+
α
2
y
2
∑
i
=
3
N
α
i
y
i
K
2
i
−
(
ς
−
α
2
y
2
)
y
1
−
α
2
\begin{aligned} \min_{\alpha_2} W(\alpha_2) = \frac{1}{2}(\varsigma-\alpha_2y_2)^2K_{11} + \frac{1}{2}\alpha_2^2K_{22}+(\varsigma-\alpha_2y_2)\alpha_2y_2K_{12}+ \\ (\varsigma - \alpha_2y_2)\sum_{i=3}^N\alpha_iy_iK_{1i}+\alpha_2y_2\sum_{i=3}^N\alpha_iy_iK_{2i} - (\varsigma - \alpha_2y_2)y_1 - \alpha_2 \tag{32} \end{aligned}
α2minW(α2)=21(ς−α2y2)2K11+21α22K22+(ς−α2y2)α2y2K12+(ς−α2y2)i=3∑NαiyiK1i+α2y2i=3∑NαiyiK2i−(ς−α2y2)y1−α2(32)
1.2 求极值点
明显地,对
α
2
\alpha_2
α2求偏导并令其为0:
(33)
∂
W
α
2
=
−
ς
y
2
K
11
+
K
11
α
2
+
K
22
α
2
+
ς
y
2
K
12
−
2
K
12
α
2
−
y
2
∑
i
=
3
N
α
i
y
i
K
1
i
+
y
2
∑
i
=
3
N
α
i
y
i
K
2
i
+
y
1
y
2
−
1
\frac{\partial W}{\alpha_2} = -\varsigma y_2 K_{11} + K_{11}\alpha_2 + K_{22}\alpha_2 + \varsigma y_2K_{12} - 2K_{12}\alpha_2 - y_2 \sum_{i=3}^N\alpha_iy_iK_{1i} +y_2\sum_{i=3}^N\alpha_iy_iK_{2i} + y_1y_2 - 1 \tag{33}
α2∂W=−ςy2K11+K11α2+K22α2+ςy2K12−2K12α2−y2i=3∑NαiyiK1i+y2i=3∑NαiyiK2i+y1y2−1(33)
(34)
=
(
K
11
+
K
22
−
2
K
12
)
α
2
+
y
2
(
−
ς
K
11
+
ς
K
12
−
∑
i
=
3
N
α
i
y
i
K
1
i
+
∑
i
=
3
N
α
i
y
i
K
2
i
+
y
1
−
y
2
)
= (K_{11}+K_{22}-2K_{12})\alpha_2 + y_2(-\varsigma K_{11} + \varsigma K_{12} -\sum_{i=3}^N\alpha_iy_iK_{1i} + \sum_{i=3}^N\alpha_iy_iK_{2i} + y_1 - y_2) \tag{34}
=(K11+K22−2K12)α2+y2(−ςK11+ςK12−i=3∑NαiyiK1i+i=3∑NαiyiK2i+y1−y2)(34)
这里要设置一些方便的符号:
1, 模型对
x
x
x的预测值
(35)
g
(
x
)
=
∑
i
=
1
N
α
i
y
i
K
(
x
i
,
x
)
+
b
g(x) = \sum_{i=1}^N\alpha_iy_iK(x_i, x) + b \tag{35}
g(x)=i=1∑NαiyiK(xi,x)+b(35)
2, 预测值减去真实值
(36)
E
i
=
g
(
x
i
)
−
y
i
=
(
∑
j
=
1
N
α
j
y
j
K
(
x
j
,
x
i
)
+
b
)
−
y
i
,
i
=
1
,
2
E_i = g(x_i) - y_i = (\sum_{j=1}^N\alpha_jy_jK(x_j, x_i) + b) - y_i, \quad i=1,2 \tag{36}
Ei=g(xi)−yi=(j=1∑NαjyjK(xj,xi)+b)−yi,i=1,2(36)
3,式(34)中比较难处理的那一坨
(37)
v
i
=
∑
j
=
3
N
α
j
y
j
K
i
j
=
g
(
x
i
)
−
∑
j
=
1
2
α
j
y
j
K
i
j
−
b
,
i
=
1
,
2
v_i = \sum_{j=3}^N\alpha_jy_jK_{ij} = g(x_i) - \sum_{j=1}^2\alpha_jy_jK_{ij} - b, \quad i=1,2 \tag{37}
vi=j=3∑NαjyjKij=g(xi)−j=1∑2αjyjKij−b,i=1,2(37)
(38)
=
E
i
+
y
i
−
∑
j
=
1
2
α
j
y
j
K
i
j
−
b
,
i
=
1
,
2
=E_i +y_i - \sum_{j=1}^2\alpha_jy_jK_{ij} - b, \quad i=1,2 \tag{38}
=Ei+yi−j=1∑2αjyjKij−b,i=1,2(38)
还要注意
α
1
y
1
+
α
2
y
2
=
∑
i
=
1
2
α
i
y
i
=
ς
\alpha_1y_1+\alpha_2y_2 = \sum_{i=1}^2\alpha_iy_i = \varsigma
α1y1+α2y2=∑i=12αiyi=ς
令式(34)为0, 并代入上面的符号:
(39)
(
K
11
+
K
22
−
2
K
12
)
α
2
n
e
w
,
u
n
c
=
y
2
(
ς
K
11
−
ς
K
12
+
∑
i
=
3
N
α
i
y
i
K
1
i
−
∑
i
=
3
N
α
i
y
i
K
2
i
−
y
1
+
y
2
)
(K_{11}+K_{22}-2K_{12})\alpha_2^{new,unc} = y_2(\varsigma K_{11} - \varsigma K_{12} +\sum_{i=3}^N\alpha_iy_iK_{1i} - \sum_{i=3}^N\alpha_iy_iK_{2i} - y_1 + y_2) \tag{39}
(K11+K22−2K12)α2new,unc=y2(ςK11−ςK12+i=3∑NαiyiK1i−i=3∑NαiyiK2i−y1+y2)(39)
(40)
=
y
2
(
∑
i
=
1
2
α
i
y
i
K
11
−
∑
i
=
1
2
α
i
y
i
K
12
+
v
1
−
v
2
−
y
1
+
y
2
)
= y_2(\sum_{i=1}^2\alpha_iy_iK_{11} - \sum_{i=1}^2\alpha_iy_iK_{12} + v_1 - v_2 - y_1 + y_2) \tag{40}
=y2(i=1∑2αiyiK11−i=1∑2αiyiK12+v1−v2−y1+y2)(40)
(41)
=
y
2
(
∑
i
=
1
2
α
i
y
i
K
11
−
∑
i
=
1
2
α
i
y
i
K
12
+
E
1
+
y
1
−
∑
i
=
1
2
α
i
y
i
K
1
i
−
b
−
E
2
−
y
2
+
∑
i
=
1
2
α
i
y
i
K
2
i
+
b
−
y
1
+
y
2
)
= y_2(\sum_{i=1}^2\alpha_iy_iK_{11} - \sum_{i=1}^2\alpha_iy_iK_{12} + E_1 + y_1 -\sum_{i=1}^2\alpha_iy_iK_{1i} - b - E_2 - y_2 +\sum_{i=1}^2\alpha_iy_iK_{2i} + b - y_1 + y_2 ) \tag{41}
=y2(i=1∑2αiyiK11−i=1∑2αiyiK12+E1+y1−i=1∑2αiyiK1i−b−E2−y2+i=1∑2αiyiK2i+b−y1+y2)(41)
(42)
=
y
2
(
E
1
−
E
2
+
α
2
y
2
K
11
−
2
α
2
y
2
K
12
+
α
2
y
2
K
22
)
= y_2(E_1 - E_2 + \alpha_2y_2K_{11} - 2\alpha_2y_2K_{12}+\alpha_2y_2K_{22}) \tag{42}
=y2(E1−E2+α2y2K11−2α2y2K12+α2y2K22)(42)
(43)
=
(
K
11
−
2
K
12
+
K
22
)
α
2
+
y
2
(
E
1
−
E
2
)
= (K_{11} - 2K_{12} + K_{22})\alpha_2 + y_2(E_1- E_2) \tag{43}
=(K11−2K12+K22)α2+y2(E1−E2)(43)
1.3 获取 α 2 n e w , u n c \alpha_2^{new,unc} α2new,unc的迭代方式
由式(43)可得:
(44)
α
2
n
e
w
,
u
n
c
=
α
2
o
l
d
+
y
2
(
E
1
−
E
2
)
K
11
−
2
K
12
+
K
22
\alpha_2^{new, unc} = \alpha_2^{old} + \frac{y_2(E_1-E_2)}{K_{11}-2K_{12}+K_{22}} \tag{44}
α2new,unc=α2old+K11−2K12+K22y2(E1−E2)(44)
1.4 根据 α 2 \alpha_2 α2的定义域裁剪得到 α 2 n e w \alpha_2^{new} α2new
上式左边有unc, 意味着没有进行cut,现在我们讨论下
α
2
\alpha_2
α2的取值范围:
关于
α
1
,
α
2
\alpha_1, \alpha_2
α1,α2的约束条件一共就式(30),(31)两个式子
那么我们根据
y
1
,
y
2
y_1, y_2
y1,y2进行分类讨论:
-
y
1
=
y
2
y_1=y_2
y1=y2
根据式(30),设 α 1 + α 2 = k \alpha_1 +\alpha_2 = k α1+α2=k
根据式(31),可得:
{ 0 ≤ α 2 ≤ C 0 ≤ k − α 2 ≤ C ⇒ { 0 ≤ α 2 ≤ C k − C ≤ α 2 ≤ k ⇒ { 0 ≤ α 2 ≤ C α 1 o l d + α 2 o l d − C ≤ α 2 ≤ α 1 o l d + α 2 o l d {\left\{ \begin{aligned} 0 & \le \alpha_2 \le C \\ 0 & \le k-\alpha_2 \le C \end{aligned} \right.} \Rightarrow {\left\{ \begin{aligned} & 0 \le \alpha_2 \le C \\ & k-C \le \alpha_2 \le k \end{aligned} \right.} \Rightarrow {\left\{ \begin{aligned} & 0 \le \alpha_2 \le C \\ & \alpha_1^{old} +\alpha_2^{old}-C \le \alpha_2 \le \alpha_1^{old} +\alpha_2^{old} \end{aligned} \right.} {00≤α2≤C≤k−α2≤C⇒{0≤α2≤Ck−C≤α2≤k⇒{0≤α2≤Cα1old+α2old−C≤α2≤α1old+α2old
设 α 2 \alpha_2 α2上界为 H H H,下界为 L L L, 则有:
(45) L = max ( 0 , α 1 o l d + α 2 o l d − C ) L = \max(0, \alpha_1^{old} +\alpha_2^{old}-C) \tag{45} L=max(0,α1old+α2old−C)(45)
(46) H = min ( C , α 1 o l d + α 2 o l d ) H = \min(C, \alpha_1^{old} +\alpha_2^{old}) \tag{46} H=min(C,α1old+α2old)(46) -
y
1
≠
y
2
y_1 \neq y_2
y1̸=y2
根据式(30),设 α 1 − α 2 = k \alpha_1 - \alpha_2 = k α1−α2=k
根据式(31),可得:
{ 0 ≤ α 2 ≤ C 0 ≤ α 2 + k ≤ C ⇒ { 0 ≤ α 2 ≤ C − k ≤ α 2 ≤ C − k ⇒ { 0 ≤ α 2 ≤ C α 2 o l d − α 1 o l d ≤ α 2 ≤ C + α 2 o l d − α 1 o l d {\left\{ \begin{aligned} 0 & \le \alpha_2 \le C \\ 0 & \le \alpha_2+k \le C \end{aligned} \right.} \Rightarrow {\left\{ \begin{aligned} 0 & \le \alpha_2 \le C \\ -k & \le \alpha_2 \le C-k \end{aligned} \right.} \Rightarrow {\left\{ \begin{aligned} 0 & \le \alpha_2 \le C \\ \alpha_2^{old} - \alpha_1^{old} & \le \alpha_2 \le C+\alpha_2^{old} - \alpha_1^{old} \end{aligned} \right.} {00≤α2≤C≤α2+k≤C⇒{0−k≤α2≤C≤α2≤C−k⇒{0α2old−α1old≤α2≤C≤α2≤C+α2old−α1old
可得新的上下界:
(47) L = max ( 0 , α 2 o l d − α 1 o l d ) L = \max(0, \alpha_2^{old} - \alpha_1^{old}) \tag{47} L=max(0,α2old−α1old)(47)
(48) H = min ( C , C + α 2 o l d − α 1 o l d ) H = \min(C, C+\alpha_2^{old} - \alpha_1^{old}) \tag{48} H=min(C,C+α2old−α1old)(48)
裁剪方法:
α 2 n e w = { H , α 2 n e w , u n c > H α 2 n e w , u n c , L ≤ α 2 n e w , u n c ≤ C L , α 2 n e w , u n c < L \alpha_2^{new}={\left\{ \begin{aligned} &H \quad \quad, & \alpha_2^{new,unc} > H \\ &\alpha_2^{new,unc}, &L \le \alpha_2^{new,unc} \le C \\ &L \quad \quad, & \alpha_2^{new,unc} < L \end{aligned} \right.} α2new=⎩⎪⎨⎪⎧H,α2new,unc,L,α2new,unc>HL≤α2new,unc≤Cα2new,unc<L
1.5 根据约束条件得到 α 1 n e w \alpha_1^{new} α1new
根据式(30)可得:
(49)
α
1
o
l
d
y
1
+
α
2
o
l
d
y
2
=
α
1
n
e
w
y
1
+
α
2
n
e
w
y
2
\alpha_1^{old}y_1 + \alpha_2^{old}y_2 = \alpha_1^{new}y_1 + \alpha_2^{new}y_2 \tag{49}
α1oldy1+α2oldy2=α1newy1+α2newy2(49)
根据上式,有:
(50)
α
1
n
e
w
=
(
α
1
o
l
d
y
1
+
α
2
o
l
d
y
2
−
α
2
n
e
w
y
2
)
y
1
\alpha_1^{new} = (\alpha_1^{old}y_1 + \alpha_2^{old}y_2-\alpha_2^{new}y_2)y_1 \tag{50}
α1new=(α1oldy1+α2oldy2−α2newy2)y1(50)
(51)
=
α
1
o
l
d
+
(
α
2
o
l
d
−
α
2
n
e
w
)
y
1
y
2
= \alpha_1^{old} + (\alpha_2^{old}-\alpha_2^{new})y_1y_2 \tag{51}
=α1old+(α2old−α2new)y1y2(51)
2, 通过对这两个 α \alpha α分量的更新,获取其它变量的更新
2.1 计算阈值b
原理,通过支持向量,即正好在间隔边界的点来进行计算(此时
0
<
α
i
<
C
,
y
i
g
(
x
i
)
=
1
0 < \alpha_i < C, y_ig(x_i) = 1
0<αi<C,yig(xi)=1):
(52)
b
=
y
i
−
∑
j
=
1
N
α
j
y
j
K
i
j
b = y_i -\sum_{j=1}^N\alpha_jy_jK_{ij} \tag{52}
b=yi−j=1∑NαjyjKij(52)
如果
α
1
\alpha_1
α1满足此条件,则
(53)
b
1
n
e
w
=
y
1
−
∑
i
=
3
N
α
i
y
i
K
1
i
−
α
1
n
e
w
y
1
K
11
−
α
2
n
e
w
y
2
K
12
b_1^{new} = y_1 - \sum_{i=3}^N\alpha_iy_iK_{1i} - \alpha_1^{new}y_1K_{11}-\alpha_2^{new}y_2K_{12} \tag{53}
b1new=y1−i=3∑NαiyiK1i−α1newy1K11−α2newy2K12(53)
上面的公式有一部分可以用
E
1
E_1
E1进行替换:
(54)
E
1
=
g
(
x
1
)
−
y
1
=
∑
i
=
3
N
α
i
y
i
K
1
i
+
α
1
o
l
d
y
1
K
11
+
α
2
o
l
d
y
2
K
21
+
b
o
l
d
−
y
1
E_1 = g(x_1) - y_1 = \sum_{i=3}^N\alpha_iy_iK_{1i} + \alpha_1^{old}y_1K_{11} +\alpha_2^{old}y_2K_{21} + b^{old} - y_1 \tag{54}
E1=g(x1)−y1=i=3∑NαiyiK1i+α1oldy1K11+α2oldy2K21+bold−y1(54)
结合式(53)与式(54),将
E
1
E_1
E1引入式(55)可得:
(55)
b
1
n
e
w
=
−
E
1
+
y
1
K
11
(
α
1
o
l
d
−
α
1
n
e
w
)
+
y
2
K
12
(
α
2
o
l
d
−
α
2
n
e
w
)
+
b
o
l
d
b_1^{new} = -E_1+y_1K_{11}(\alpha_1^{old}-\alpha_1^{new}) + y_2K_{12}(\alpha_2^{old}-\alpha_2^{new}) + b^{old} \tag{55}
b1new=−E1+y1K11(α1old−α1new)+y2K12(α2old−α2new)+bold(55)
每次计算的时候,存下
E
i
E_i
Ei可以极大的方便计算
同理,如果
0
<
α
2
n
e
w
<
C
0 < \alpha_2^{new} <C
0<α2new<C, 则:
(56)
b
2
n
e
w
=
−
E
2
+
y
1
K
12
(
α
1
o
l
d
−
α
1
n
e
w
)
+
y
2
K
22
(
α
2
o
l
d
−
α
2
n
e
w
)
+
b
o
l
d
b_2^{new} = -E_2 + y_1K_{12}(\alpha_1^{old}-\alpha_1^{new}) + y_2K_{22}(\alpha_2^{old}-\alpha_2^{new}) +b^{old} \tag{56}
b2new=−E2+y1K12(α1old−α1new)+y2K22(α2old−α2new)+bold(56)
下面讨论
b
n
e
w
b^{new}
bnew的最终取值:
-
若 0 < α 1 < C 0<\alpha_1<C 0<α1<C, 0 < α 2 < C 0<\alpha_2<C 0<α2<C:
b n e w = b 1 n e w = b 2 n e w b^{new} = b_1^{new} = b_2^{new} bnew=b1new=b2new (此时 x 1 x_1 x1与 x 2 x_2 x2都在间隔边界上) -
若只有一个 0 < α i < C , i ∈ { 1 , 2 } 0<\alpha_i<C, \quad i \in \{1,2\} 0<αi<C,i∈{1,2}
b n e w = b i n e w b^{new} = b_i^{new} bnew=binew -
若 α 1 , α 2 ∈ { 0 , C } \alpha_1,\alpha_2 \in \{0,C\} α1,α2∈{0,C}
b n e w = 1 2 ( b 1 n e w + b 2 n e w ) b^{new} = \frac{1}{2}(b_1^{new}+b_2^{new}) bnew=21(b1new+b2new), (若 α i = 0 \alpha_i=0 αi=0说明 x i x_i xi不是支持向量, y i g ( x i ) ≥ 1 y_ig(x_i) \ge 1 yig(xi)≥1, x i x_i xi在正确分类的间隔一侧, α i = C \alpha_i=C αi=C 说明 y i g ( x i ) ≤ 1 y_ig(x_i) \le 1 yig(xi)≤1, 这些都可以从式(22)-式(30)的KKT条件推出,下面还会推导)
2.2 更新 E i E_i Ei,方便下一次的 b b b 计算
(57)
E
i
=
∑
s
α
j
y
j
K
i
j
−
y
i
E_i = \sum_{s}\alpha_jy_jK_{ij} - y_i \tag{57}
Ei=s∑αjyjKij−yi(57)
其中
s
s
s是所有>0的
α
j
\alpha_j
αj, 即所有支持向量
3 α \alpha α选取策略
3.1 通过满足KKT条件与否选择 α 1 \alpha_1 α1
因为收敛后的最优解是满足KKT条件的,所以第一次选择最不满足KKT条件的
α
\alpha
α:
从式(20)-(28)可得:
-
α i = 0 \alpha_i = 0 αi=0
(1), 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 C−αi∗−ui∗=0 可得 u i ∗ = C > 0 u_i^*=C > 0 ui∗=C>0
(2), 根据 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^* = 0 ui∗ξi∗=0 可得 ξ i ∗ = 0 \xi_i^*=0 ξi∗=0
(3), 根据 y i ( w ∗ x i + b ∗ ) ≥ 1 − ξ i ∗ y_i(w^*x_i+b^*) \ge 1 - \xi_i^* yi(w∗xi+b∗)≥1−ξi∗ 可得 y i ( w ∗ x i + b ∗ ) ≥ 1 y_i(w^*x_i +b^*) \ge 1 yi(w∗xi+b∗)≥1
(4), 综上, (58) α i = 0 ⇔ y i g ( x i ) ≥ 1 \alpha_i = 0 \Leftrightarrow y_ig(x_i) \ge 1 \tag{58} αi=0⇔yig(xi)≥1(58) -
0 < α i < C 0 < \alpha_i <C 0<αi<C
(1), 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 C−αi∗−ui∗=0 可得 u i ∗ > 0 u_i^*> 0 ui∗>0
(2), 根据 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^* = 0 ui∗ξi∗=0 可得 ξ i ∗ = 0 \xi_i^*=0 ξi∗=0
(3), 根据 α i ∗ ( y i ( w ∗ x i + b ∗ ) − 1 + ξ ∗ ) = 0 \alpha_i^*(y_i(w^*x_i+b^*)-1+\xi^*) = 0 αi∗(yi(w∗xi+b∗)−1+ξ∗)=0 及上面一条可得 y i ( w ∗ x i + b ∗ ) − 1 = 0 y_i(w^*x_i+b^*) - 1=0 yi(w∗xi+b∗)−1=0
(4), 综上, (59) 0 < α i < C ⇔ y i g ( x i ) = 1 0 < \alpha_i < C \Leftrightarrow y_ig(x_i) = 1 \tag{59} 0<αi<C⇔yig(xi)=1(59) -
α i = C \alpha_i = C αi=C
(1). 根据 C − α i ∗ − u i ∗ = 0 C-\alpha_i^*-u_i^*=0 C−αi∗−ui∗=0 可得 u i ∗ = 0 u_i^*= 0 ui∗=0
(2). 根据 u i ∗ = 0 u_i^* = 0 ui∗=0及 u i ∗ ξ i ∗ = 0 u_i^*\xi_i^*=0 ui∗ξi∗=0, ξ i ∗ ≥ 0 \xi_i^* \ge 0 ξi∗≥0可得 ξ i ∗ ≥ 0 \xi_i^* \ge 0 ξi∗≥0
(3). 根据 α i ∗ ( y i ( w ∗ x i + b ∗ ) − 1 + ξ i ∗ ) = 0 \alpha_i^*(y_i(w^*x_i+b^*)-1+\xi_i^*) = 0 αi∗(yi(w∗xi+b∗)−1+ξi∗)=0及 α i = C > 0 \alpha_i=C>0 αi=C>0可得
y i ( w ∗ x i + b ∗ ) − 1 + ξ i ∗ = 0 y_i(w^*x_i + b^*) - 1 +\xi_i^* = 0 yi(w∗xi+b∗)−1+ξi∗=0
(4). 根据推论(2)及推论(3)可得: y i ( w ∗ x i + b ∗ ) ≤ 1 y_i(w^*x_i+b^*) \le 1 yi(w∗xi+b∗)≤1
(5). 综上, (60) α i = C ⇔ y i g ( x i ) ≤ 1 \alpha_i=C \Leftrightarrow y_ig(x_i) \le 1 \tag{60} αi=C⇔yig(xi)≤1(60)
这里要说明一下,计算机里常有浮点数精度的问题,直接用"=="往往会得出错误的结果,上面的KKT条件检查全部都应该在
ϵ
\epsilon
ϵ 的精度下进行
选择算法:首先选取
0
<
α
i
<
C
0 < \alpha_i < C
0<αi<C的支持向量样本点,检查是否满足式(61)
如果不满足,则选择它
如果都满足,遍历整个训练集查看它们是否满足KKT条件, 如果都满足则满足停机条件
3.2 根据极大化 α 1 \alpha_1 α1的变化来寻找 α 2 \alpha_2 α2
根据式(44),
α
1
\alpha_1
α1与
E
1
−
E
2
E_1 - E_2
E1−E2呈线性关系,所以对
α
2
\alpha_2
α2的选取策略:
遍历找到使
∣
E
1
−
E
2
∣
|E_1 - E_2|
∣E1−E2∣最大的
E
2
E_2
E2,其对应的
α
\alpha
α分量即为
α
2
\alpha_2
α2
这里可以看出,
E
E
E 在更新
α
2
\alpha_2
α2, 阈值
b
b
b 与寻找
α
2
\alpha_2
α2的过程中发挥了极大的作用
可是这个简单策略有时会找不到令目标函数式(17)有足够下降的点,怎么办呢?只能依次遍历在间隔边界上的点,看它们中是否有点能使目标函数有足够下降
如果还是找不到呢?那只能放弃 α i \alpha_i αi重新选择了
所以到这里我发现SMO算法有回溯的情况
4 简单总结整个SMO算法的流程
1, 根据是否满足KKT条件寻找一个
α
\alpha
α的分量作为
α
1
\alpha_1
α1, 满足停机条件则算法结束
2, 根据极大化
α
1
\alpha_1
α1的变化寻找
α
2
\alpha_2
α2, 这里可能会有回溯情况,重回第1步
3, 根据
E
1
,
E
2
E_1, E_2
E1,E2等,即式(44)得到
α
2
n
e
w
,
u
n
c
\alpha_2^{new, unc}
α2new,unc
4, 对
α
2
n
e
w
,
u
n
c
\alpha_2^{new, unc}
α2new,unc进行定义域剪切得到
α
2
n
e
w
\alpha_2^{new}
α2new
5, 紧接着根据式(51)得到
α
1
n
e
w
\alpha_1^{new}
α1new
6, 根据(55),(56) 通过
E
i
E_i
Ei 获取
b
1
n
e
w
,
b
2
n
e
w
b_1^{new}, b_2^{new}
b1new,b2new
7, 根据
α
1
,
α
2
\alpha_1, \alpha_2
α1,α2对
0
0
0与
C
C
C的大小关系获取
b
n
e
w
b^{new}
bnew
8, 更新
E
1
,
E
2
E_1, E_2
E1,E2,为下一轮计算做准备
循环以上步骤直到达到指定的轮次数或满足停机条件
三. 用python实现核心步骤并进行代码片段讲解
大致讲解流程根据上面的算法流程来
1. 检查是否满足KKT条件
# check if the alpha[i] satisfy the KKT condition:
def _satisfy_KKT_(self, i):
tmp = self.Y[i]*self._g_(i)
# 式(60)
if abs(self.alpha[i]) < self.epsilon: # epsilon is the precision for checking if two var equal
return tmp >= 1
# 式(62)
elif abs(self.alpha[i] - self.C) < self.epsilon:
return tmp <= 1
# 式(61)
else:
return abs(tmp - 1) < self.epsilon
2. 寻找 α 2 \alpha_2 α2
imax = (0, 0)# store |E1-E2|, index of alpha2
E1 = self.E[i]
alpha_1_index.remove(i)
# 寻找使|E1 - E2|最大的alpha_2
for j in alpha_1_index:
E2 = self.E[j]
if abs(E1 - E2) > imax[0]:
imax = (abs(E1 - E2), j)
return i, imax[1]
3. 获取 α 2 n e w , u n c \alpha_2^{new,unc} α2new,unc并进行剪切获得 α 2 n e w \alpha_2^{new} α2new, 再获取 α 1 n e w \alpha_1^{new} α1new
E1, E2 = self.E[i1], self.E[i2]
# eta即式(46)的分母
eta = self._K_(i1, i1) + self._K_(i2, i2) - 2*self._K_(i1, i2) # 7.107
# 式(46)
alpha2_new_unc = self.alpha[i2] + self.Y[i2] * (E1-E2) / eta # 7.106
# 剪切
if self.Y[i1] == self.Y[i2]:
L = max(0, self.alpha[i2] + self.alpha[i1] - self.C)
H = min(self.C, self.alpha[i2] + self.alpha[i1])
else:
L = max(0, self.alpha[i2] - self.alpha[i1])
H = min(self.C, self.C + self.alpha[i2] - self.alpha[i1])
alpha2_new = H if alpha2_new_unc > H else L if alpha2_new_unc < L else alpha2_new_unc # 7.108
# 式(53)
alpha1_new = self.alpha[i1] + self.Y[i1]*self.Y[i2]*(self.alpha[i2] - alpha2_new)
4.获取新的阈值 b n e w b^{new} bnew
# 式(57)
b1_new = -E1 - self.Y[i1]*self._K_(i1,i1)*(alpha1_new - self.alpha[i1]) \
- self.Y[i2]*self._K_(i2,i1)*(alpha2_new - self.alpha[i2]) + self.b
# 式(58)
b2_new = -E2 - self.Y[i1]*self._K_(i1,i2)*(alpha1_new - self.alpha[i1]) \
- self.Y[i2]*self._K_(i2,i2)*(alpha2_new - self.alpha[i2]) + self.b
# 式(58)-式(59)之间对b的取值讨论
if alpha1_new > 0 and alpha1_new < self.C:
self.b = b1_new
elif alpha2_new > 0 and alpha2_new < self.C:
self.b = b2_new
else:
self.b = (b1_new + b2_new) / 2
5. 更新 E 1 , E 2 E_1, E_2 E1,E2
# 式(37), 对x_i的预测值
def _g_(self, i):
K = np.array([self._K_(j, i) for j in range(self.m)])
return np.dot(self.alpha * self.Y, K) + self.b
# 式(38), 对x_i的预测值与y_i的差值
def _E_(self, i):
return self._g_(i) - self.Y[i]
# 更新alpha_1, alpha_2对应的E_1, E_2
self.E[i1] = self._E_(i1)
self.E[i2] = self._E_(i2)
以上完整代码在我的github
自问才疏学浅,不当之处请大家指出
参考:
《统计学习方法》
《理解SVM的三重境界》
《机器学习》