[Machine Learning] SVM--support vector machine

最新推荐文章于 2022-03-04 10:57:10 发布

艳艳儿

最新推荐文章于 2022-03-04 10:57:10 发布

阅读量2k

点赞数

CC 4.0 BY-SA版权

分类专栏： machine learning 文章标签：机器学习

本文链接：https://blog.youkuaiyun.com/COMEYAN/article/details/50473699

machine learning 专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了支持向量机(SVM)的原理与应用，包括线性和非线性分类情况下的模型建立与求解算法，如SMO算法，并探讨了非约束SVM及合页损失函数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Linear classifier
- 1 Separable situation
  - i SVM model
  - ii Algorithm
- 2 Un-separable situation
  - i SVM model
  - ii Algorithm
Nonlinear classifier
- 1 SVM model via kernel function
- 2 SMO Algorithm
Unconstrain SVM - Hinge loss function
Summary

1. Linear classifier

1.1 Separable situation

(i) SVM model

Given a data set $(X^{(i)}, y^{(i)})_{i=1}^N$ with $X^{(i)}\in \mathbb{R}^p$ and $y^{(i)} \in \{-1,1\}$ , when the two classes are separable, SVM tends to get a linear classifier that creates largest margin.

Then we can immediately get the optimization problem:

max subject to : C y ( i ) ( ω \cdot X ( i ) + b ) ∥ ω ∥ 2                      geometric margin \geq C for all i

$\begin{align} \max&\quad C\\ \text{subject to}:&\quad \underbrace{\frac{y^{(i)}\left(\omega\cdot X^{(i)}+b\right)}{\|\omega\|_2}}_{\text{geometric margin}}\geq C\quad\text{ for all i} \end{align}$
which can be rewritten as

max subject to : C y (i) (ω \cdot X (i) + b) \geq C ∥ ω ∥ 2        γ : functional margin for all i

$\begin{align} \max&\quad C\\ \text{subject to}:&\quad y^{(i)}\left(\omega\cdot X^{(i)}+b\right) \geq \underbrace{C\|\omega\|_2}_{\gamma: \text{ functional margin}}\quad\text{ for all i} \end{align}$
which is equivalent to

max subject to : γ ∥ ω ∥ 2 y (i) (ω \cdot X (i) + b) \geq γ for all i

$\begin{align} \max&\quad \frac{\gamma}{\|\omega\|_2}\\ \text{subject to}:&\quad y^{(i)}\left(\omega\cdot X^{(i)}+b\right) \geq \gamma\quad\text{ for all i} \end{align}$

Note that if $(\omega,b)$ scales to $(\lambda \omega,\lambda b)$ , the functional margin $\gamma$ scales to $\lambda\gamma$ . So the constrain and the objective function $\frac{\gamma}{\|\omega\|_2}$ do not change, which says this optimization problem is free of scaling. Then we choose suitable scaling to make functional margin to be 1 and the corresponding geometric margin to be $\frac{1}{\|\omega\|_2}$ ), then the optimization problem will be

min subject to : 1 2 ∥ ω ∥ 22 y (i) (ω \cdot X (i) + b) \geq 1 for all i

$\begin{align} \min&\quad \frac{1}{2}\|\omega\|_2^2\\ \text{subject to}:&\quad y^{(i)}\left(\omega\cdot X^{(i)}+b\right) \geq 1\quad\text{ for all i} \end{align}$
which is a convex optimization problem.

(ii) Algorithm

Now I will summarize the steps to get the solution of SVM model. What we use is the idea of Lagrange multiplier method. The Lagrange function of last problem is

L (ω, b, α) = 1 2 ∥ ω ∥ 22 + \sum i = 1 N α i - \sum i = 1 N α i y (i) (ω \cdot X (i) + b)

$\begin{align} L(\omega,b,\alpha) = \frac{1}{2}\|\omega\|_2^2 +\sum_{i=1}^N\alpha_i - \sum_{i=1}^N\alpha_iy^{(i)}\left(\omega\cdot X^{(i)}+b\right) \end{align}$
And the primal problem can be written as

min ω, b max α \geq 0 L (ω, b, α)

$\min_{\omega,b}\max_{\alpha\geq 0}L(\omega,b,\alpha)$
then the dual problem will be

max α \geq 0 min ω, b L (ω, b, α)

$\max_{\alpha\geq 0}\min_{\omega,b}L(\omega,b,\alpha)$

From now on, we will focus to solve the dual problem rather than primal problem. Firstly, we need to solve $\forall \alpha\geq 0$

min ω, b L (ω, b, α)

$\min_{\omega,b}L(\omega,b,\alpha)$
Get the derivatives w.r.t

(ω,b) $(\omega,b)$ and set them to zero, we get

\partial L ( ω , b , α ) \partial ω \partial L ( ω , b , α ) \partial b = ω - \sum i = 1 N α i y (i) X (i) = 0 = - \sum i = 1 N α i y (i) = 0

$\begin{align} \frac{\partial L(\omega,b,\alpha)}{\partial \omega} &= \omega - \sum_{i=1}^N \alpha_iy^{(i)} X^{(i)}=0\\ \frac{\partial L(\omega,b,\alpha)}{\partial b}&= -\sum_{i=1}^N \alpha_iy^{(i)}=0 \end{align}$

Then substituting these two equations into Lagrange function, we get the final dual optimization problem

max subject to: \sum i = 1 N α i - 1 2 \sum i = 1 N \sum j = 1 N α i α j y (i) y (j) X (i) \cdot X (j) \sum i = 1 N α i y (i) = 0 α i \geq 0 for all i

$\begin{align} \max&\quad \sum_{i=1}^N \alpha_i - \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i \alpha_j y^{(i)} y^{(j)} X^{(i)}\cdot X^{(j)}\\ \text{subject to:}&\quad \sum_{i=1}^N \alpha_i y^{(i)}=0\\ &\quad \alpha_i \geq 0\quad\text{ for all i}\\ \end{align}$

Assume we have get the optimal solution $\alpha^*$ for dual problem (the exact method will be given in the third section). Then from the KKT conditions

(feasibility) $\alpha^*\geq 0$ , $y^{(i)}(\omega^*\cdot X^{(i)}+b^*)\geq 1$ .
(complementary slackness) $\alpha_i y^{(i)}(\omega^*\cdot X^{(i)}+b^*)- 1=0$
(stationary) $(\omega^*, b^*) = \arg\min_{\omega,b}L(\omega,b, \alpha^*)$

Then from the derivative of $\omega$ given above, we have

ω * = \sum i = 1 N α * i y (i) X (i)

$\omega^* = \sum_{i=1}^N\alpha_i^*y^{(i)}X^{(i)}$
and from the complementary slackness condition,

∀α∗j>0 $\forall \alpha_j^*>0$ , we have

y (j) (ω * \cdot X (j) + b *) = 1 \to b * = y (j) - ω * \cdot X (j)

$y^{(j)}(\omega^*\cdot X^{(j)}+b^*)=1\rightarrow b^* = y^{(j)}-\omega^*\cdot X^{(j)}$

Theorem: We get the optimal solution for primal problem in the following form

ω * b * = \sum i = 1 N α * i y (i) X (i) = y (j) - ω * \cdot X (j) for α * j > 0

$\begin{align} \omega^* &= \sum_{i=1}^N\alpha_i^*y^{(i)}X^{(i)}\\ b^*&=y^{(j)}-\omega^*\cdot X^{(j)} \quad\text{for }\alpha_j^*>0 \end{align}$
And the prediction function is given by

y^= ω * \cdot X + b * = \sum i = 1 N α * i y (i) X (i) \cdot X + b *

$\hat y = \omega^*\cdot X+b^* = \sum_{i=1}^N\alpha_i^*y^{(i)}X^{(i)} \cdot X+b^*$

1.2 Un-separable situation

(i) SVM model

Asking the data set is linear separable is a stricte condition. Here I will talk about how to deal with the data set if it is not linear separable.

Introducing slack variables $\xi_i\geq 0$ for every example $(X^{(i)},y^{(i)})$ . And define optimization problem as

min subject to: 1 2 ∥ ω ∥ 22 + C \sum i = 1 N ξ i          penalty of ξ y (i) (ω \cdot X (i) + b) \geq 1 - ξ i for all i ξ i \geq 0 for all i

$\begin{align} \min&\quad\frac{1}{2}\|\omega\|_2^2+\underbrace{C\sum_{i=1}^N\xi_i}_{\text{penalty of $\xi$}}\\ \text{subject to:}&\quad y^{(i)}(\omega\cdot X^{(i)}+b)\geq 1- \xi_i\quad \text{for all i}\\ &\quad\xi_i\geq 0\quad \text{for all i} \end{align}$

(ii) Algorithm

All steps are the same as that in the separable situation, we get the dual optimization problem

max subject to: \sum i = 1 N α i - 1 2 \sum i = 1 N \sum j = 1 N α i α j y (i) y (j) X (i) \cdot X (j) \sum i = 1 N α i y (i) = 0 0 \leq α i \leq C for all i

$\begin{align} \max&\quad \sum_{i=1}^N \alpha_i - \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i \alpha_j y^{(i)} y^{(j)} X^{(i)}\cdot X^{(j)}\\ \text{subject to:}&\quad \sum_{i=1}^N \alpha_i y^{(i)}=0\\ &\quad 0\leq\alpha_i \leq C\quad\text{ for all i}\\ \end{align}$

And assume we get the optimal solution for dual problem, we have
Theorem:The optimal solutions for non-separable situation are given in the following form

ω * b * = \sum i = 1 N α * i y (i) X (i) = y (j) - ω * \cdot X (j) for 0 < α * j < C

$\begin{align} \omega^* &= \sum_{i=1}^N\alpha_i^*y^{(i)}X^{(i)}\\ b^*&=y^{(j)}-\omega^*\cdot X^{(j)} \quad\text{for }0<\alpha_j^*<C \end{align}$
And the prediction function is given by

y^= ω * \cdot X + b * = \sum i = 1 N α * i y (i) X (i) \cdot X + b *

$\hat y = \omega^*\cdot X+b^* = \sum_{i=1}^N\alpha_i^*y^{(i)}X^{(i)} \cdot X+b^*$

Note that: In fact there exit $\beta_i$ for every $\xi_i$ such that $\beta\geq 0$ , $\beta_i\xi_i=0$ and $\alpha_i+\beta_i=C$ . So from the KKT conditions, we have

When $0<\alpha_i^*<C$ , $y^{(i)}(\omega^*\cdot X^{(i)}+b^*)= 1$
When $\alpha_i^*=C$ , $y^{(i)}(\omega^*\cdot X^{(i)}+b^*)\leq 1$
When $\alpha_i^*=0$ , $y^{(i)}(\omega^*\cdot X^{(i)}+b^*)\geq 1$

2. Nonlinear classifier

2.1 SVM model via kernel function

Note that both the optimization problem of SVM and the prediction function in the last section are defined by the inner product of $X^{(i)}\cdot X^{(j)}$ . So we can use kernel trick to generalize SVM model to a nonlinear classifier.

The core idea of kernel is that transforming $X^{(i)}$ into more high dimensional space using $\phi$ and in the high dimensional space, the data set $\phi(X^{(i)})$ is linear separable, then we can built SVM model on $\phi(X^{(i)})$ described in the last section. But to be more efficiency, we can use a symmetric function $K(x,y)$ (Called Kernel)to replace the inner product in high dimension $\phi(x)\cdot \phi(y)$ .

The following theorem gives the sufficient and necessary condition for a valid kernel function.
Theorem: A symmetric function $K(x,y)$ defined on $\mathbb{S}$ is a valid kernel function iff $\forall m$ and $\forall X^{(i)}\in \mathbb{S}$ , the symmetric matrix $K_{ij}= K(X^{(i)},X^{(j)})$ is a positive semi-definite matrix.

Common used kernel:

Polynomial kernel function: $K(x,y) = (x\cdot y+\ell)^d$
Gaussian kernel function: $K(x,y) = \exp(-\frac{\|x-y\|_2^2}{2\sigma^2})$

Then replacing the inner product of $X^{(i)}\cdot X^{(j)}$ by $K_{ij} = K(X^{(i)},X^{(j)})$ , we get the optimization problem for nonlinear classifier

max subject to: \sum i = 1 N α i - 1 2 \sum i = 1 N \sum j = 1 N α i α j y (i) y (j) K i j \sum i = 1 N α i y (i) = 0 0 \leq α i \leq C for all i

$\begin{align} \max&\quad \sum_{i=1}^N \alpha_i - \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N\alpha_i \alpha_j y^{(i)} y^{(j)} K_{ij}\\ \text{subject to:}&\quad \sum_{i=1}^N \alpha_i y^{(i)}=0\\ &\quad 0\leq\alpha_i \leq C\quad\text{ for all i}\\ \end{align}$
And the prediction function is given by

y^= \sum i = 1 N α * i y (i) K (X (i), X) + b *

$\hat y = \sum_{i=1}^N \alpha_i^*y^{(i)} K(X^{(i)},X)+b^*$

2.2 SMO Algorithm

SMO is short for sequential minimal optimization, which is a generalization of coordinate ascent method.

Generally speaking, SMO is maximizing two variables simultaneously when others fixed while coordinate ascent method is maximizing only one variable when others fixed.

If we assume $\alpha_1$ and $\alpha_2$ are chosen to be maximized, we can rewrite the dual problem as optimization of $\alpha_1$ and $\alpha_2$

max subject to α 1 + α 2 - 1 2 K 11 α 21 - 1 2 K 22 α 22 - K 12 y (1) y (2) α 1 α 2 - v 1 y (1) α 1 - v 2 y (2) α 2 y (1) α 1 + y (2) α 2 = ζ 0 \leq α i \leq C for i=1,2

$\begin{align} \max&\quad \alpha_1+\alpha_2 - \frac{1}{2}K_{11}\alpha_1^2 - \frac{1}{2}K_{22}\alpha_2^2 - K_{12}y^{(1)}y^{(2)}\alpha_1\alpha_2 - v_1 y^{(1)}\alpha_1 - v_2y^{(2)}\alpha_2\\ \text{subject to}&\quad y^{(1)}\alpha_1+y^{(2)}\alpha_2=\zeta\\ &\quad 0\leq \alpha_i\leq C\quad\text{for i=1,2} \end{align}$
where

vi=∑Nj=3αjy(j)Kji $v_i = \sum_{j=3}^N\alpha_j y^{(j)} K_{ji}$ and

ζ=−∑Ni=3y(i)αi $\zeta = - \sum_{i=3}^N y^{(i)} \alpha_i$ .

The constrains can be seen clearly in the following pictures,

It is clear that $\alpha_1$ and $\alpha_2$ are on a line constrained by a block both before and after updating, which means

y (1) α o l d 1 + y (2) α o l d 2 = ζ = y (1) α n e w 1 + y (2) α n e w 2

$y^{(1)}\alpha_1^{old}+y^{(2)}\alpha_2^{old}=\zeta=y^{(1)}\alpha_1^{new}+y^{(2)}\alpha_2^{new}$

If $y^{(1)}=y^{(2)}$ , we have

$α n e w 2 = α o l d 1 + α o l d 2 - α n e w 1 \in [α o l d 1 + α o l d 2 - C, α o l d 1 + α o l d 2]$ $\begin{align} \alpha_2^{new} &= \alpha_1^{old}+\alpha_2^{old} - \alpha_1^{new}\\ &\in[\alpha_1^{old}+\alpha_2^{old}-C, \alpha_1^{old}+\alpha_2^{old}] \end{align}$
Together with $0\leq \alpha_2^{new}\leq C$ , we have
$L \leq α n e w 2 \leq H$ $L\leq \alpha_2^{new}\leq H$
where
$L H = max {0, α o l d 1 + α o l d 2 - C} = min {C, α o l d 1 + α o l d 2}$ $\begin{align} L&=\max\{0,\alpha_1^{old}+\alpha_2^{old}-C\}\\ H&=\min\{C,\alpha_1^{old}+\alpha_2^{old}\} \end{align}$
If $y^{(1)}\neq y^{(2)}$ , we have

$α n e w 2 = α o l d 2 - α o l d 1 + α n e w 1 \in [α o l d 2 - α o l d 1, α o l d 2 - α o l d 1 + C]$ $\begin{align} \alpha_2^{new} &= \alpha_2^{old}-\alpha_1^{old} + \alpha_1^{new}\\ &\in[\alpha_2^{old}-\alpha_1^{old}, \alpha_2^{old}-\alpha_1^{old}+C] \end{align}$

Together with $0\leq \alpha_2^{new}\leq C$ , we have

L \leq α n e w 2 \leq H

$L\leq \alpha_2^{new}\leq H$
where

L H = max {0, α o l d 2 - α o l d 1} = min {C, α o l d 2 - α o l d 1 + C}

$\begin{align} L&=\max\{0,\alpha_2^{old}-\alpha_1^{old}\}\\ H&=\min\{C,\alpha_2^{old}-\alpha_1^{old}+C\} \end{align}$

Then substituting

α 1 = y (1) (ζ - y (2) α 2)

$\alpha_1 = y^{(1)}(\zeta - y^{(2)}\alpha_2)$
in the optimization problem, we get

max subject to y (2) (y (2) - y (1) + ζ K 11 - ζ K 12 + v 1 - v 2) α 2 - 1 2 (K 11 + K 22 - 2 K 12) α 22 L \leq α 2 \leq H

$\begin{align} \max&\quad y^{(2)}\left(y^{(2)} - y^{(1)}+\zeta K_{11} - \zeta K_{12}+v_1-v_2\right)\alpha_2-\frac{1}{2}(K_{11}+K_{22}-2K_{12})\alpha_2^2\\ \text{subject to}&\quad L\leq \alpha_2\leq H \end{align}$

Firstly we get the optimal solution without constrain. Taking derivative w.r.t $\alpha_2$ and setting it to zero, we have

(K 11 + K 22 - 2 K 12) α 2 = y (2) (y (2) - y (1) + ζ K 11 - ζ K 12 + v 1 - v 2) = y (2) (y (2) - y (1) + ζ K 11 - ζ K 12 + (g (X (1)) - \sum i = 1 2 α o l d i y (i) K i 1 - b) - (g (X (2)) - \sum i = 1 2 α o l d i y (i) K i 2 - b)) = y (2) (E 1 - E 2 + ζ K 11 - ζ K 12 - \sum i = 1 2 α o l d i y (i) K i 1 + \sum i = 1 2 α o l d i y (i) K i 2) = y (2) (E 1 - E 2 + (y (1) α o l d 1 + y (2) α o l d 2) (K 11 - K 12) - \sum i = 1 2 α o l d i y (i) K i 1 + \sum i = 1 2 α o l d i y (i) K i 2) = y (2) (E 1 - E 2 + y (2) α o l d 2 (K 11 + K 22 - 2 K 12)) = y (2) (E 1 - E 2) + α o l d 2 (K 11 + K 22 - 2 K 12)

$\begin{align} &\quad(K_{11}+K_{22}-2K_{12})\alpha_2\\ &=y^{(2)}\left(y^{(2)} - y^{(1)}+\zeta K_{11} - \zeta K_{12}+v_1-v_2\right)\\ &=y^{(2)}\left(y^{(2)} - y^{(1)}+\zeta K_{11} - \zeta K_{12}+\left(g(X^{(1)}) - \sum_{i=1}^2\alpha_i ^{old}y^{(i)}K_{i1} - b\right)-\left(g(X^{(2)}) - \sum_{i=1}^2\alpha_i^{old} y^{(i)}K_{i2} - b\right)\right)\\ &=y^{(2)}\left(E_1 - E_2+\zeta K_{11} - \zeta K_{12}- \sum_{i=1}^2\alpha_i^{old} y^{(i)}K_{i1}+\sum_{i=1}^2\alpha_i ^{old}y^{(i)}K_{i2}\right)\\ &=y^{(2)}\left(E_1 - E_2+\left(y^{(1)}\alpha_1^{old}+y^{(2)}\alpha_2^{old}\right) (K_{11} -K_{12})- \sum_{i=1}^2\alpha_i^{old} y^{(i)}K_{i1}+\sum_{i=1}^2\alpha_i ^{old}y^{(i)}K_{i2}\right)\\ &=y^{(2)}\left(E_1 - E_2+y^{(2)}\alpha_2^{old}(K_{11} +K_{22}-2K_{12})\right)\\ &=y^{(2)}\left(E_1 - E_2\right)+\alpha_2^{old}(K_{11} +K_{22}-2K_{12}) \end{align}$
where

g(X)=∑Ni=1αoldiy(i)K(X(i),X) $g(X) = \sum_{i=1}^N \alpha_i^{old}y^{(i)}K(X^{(i)},X)$ is the prediction value of

X $X$ and

Ei=g(X(i))−y(i) $E_i = g(X^{(i)}) - y^{(i)}$

So the un-truncated solution is

α n e w, u n c 2 = α o l d 2 + E 1 - E 2 K 11 + K 22 - 2 K 12 y (2)

$\alpha_2^{new,unc} = \alpha^{old}_2+\frac{E_1 - E_2}{K_{11} +K_{22}-2K_{12}}y^{(2)}$ is the prediction eror.

Then truncating it, we get the optimal solution

α n e w 2 = ⎧ ⎩ ⎨ ⎪ ⎪ L α n e w, u n c 2 H α n e w, u n c 2 < L L \leq α n e w, u n c 2 \leq H α n e w, u n c 2 > H

$\begin{align} \alpha_2^{new}= \left\{ \begin{matrix} L&\quad \alpha_2^{new,unc}<L\\ \alpha_2^{new,unc}&\quad L\leq\alpha_2^{new,unc}\leq H\\ H&\quad\alpha_2^{new,unc}>H \end{matrix} \right. \end{align}$

3. Unconstrain SVM - Hinge loss function

Note that

$\xi_i=0 \Leftrightarrow y^{(i)} (\omega\cdot X^{(i)}+b)\geq 1$
$\xi_i>0 \Leftrightarrow y^{(i)}(\omega\cdot X^{(i)}+b)\leq 1$

ξ i = [1 - y (i) (ω \cdot X (i) + b)] +

$\xi_i = [1-y^{(i)}(\omega\cdot X^{(i)}+b)]_+$
We can rewrite SVM as an unconstrained optimization problem:

min 1 2 ∥ ω ∥ 22 + C \sum i = 1 N [1 - y (i) (ω \cdot X (i) + b)] +

$\begin{align} \min \frac{1}{2}\|\omega\|_2^2+C\sum_{i=1}^N[1-y^{(i)}(\omega\cdot X^{(i)}+b)]_+ \end{align}$
which is equivalent to

min \sum i = 1 N [1 - y (i) (ω \cdot X (i) + b)] + + λ ∥ ω ∥ 22

$\begin{align} \min \sum_{i=1}^N[1-y^{(i)}(\omega\cdot X^{(i)}+b)]_+ + \lambda \|\omega\|_2^2 \end{align}$
And the loss function

f(x)=[1−x]+ $f(x)=[1-x]_+$ is called hinge loss function, which is approximating 0-1 loss function.
这里写图片描述

This problem can be optimized by sub-gradient method.

5. Summary

SVM is a linear classifier and can be extend to nonlinear situation using kernel trick
SVM is only determined by the support vector. Changing the position of other points do not change SVM.
From the view of dual optimization problem, we use SMO and KKT to get the optimal solution for primal optimization probelm.