1. Linear classifier
1.1 Separable situation
(i) SVM model
Given a data set
(X(i),y(i))Ni=1
with
X(i)∈Rp
and
y(i)∈{−1,1}
, when the two classes are separable, SVM tends to get a linear classifier that creates largest margin.
Then we can immediately get the optimization problem:
which can be rewritten as
which is equivalent to
Note that if
(ω,b)
scales to
(λω,λb)
, the functional margin
γ
scales to
λγ
. So the constrain and the objective function
γ∥ω∥2
do not change, which says this optimization problem is free of scaling. Then we choose suitable scaling to make functional margin to be 1 and the corresponding geometric margin to be
1∥ω∥2
), then the optimization problem will be
which is a convex optimization problem.
(ii) Algorithm
Now I will summarize the steps to get the solution of SVM model. What we use is the idea of Lagrange multiplier method. The Lagrange function of last problem is
And the primal problem can be written as
then the dual problem will be
From now on, we will focus to solve the dual problem rather than primal problem. Firstly, we need to solve
∀α≥0
Get the derivatives w.r.t (ω,b) and set them to zero, we get
Then substituting these two equations into Lagrange function, we get the final dual optimization problem
Assume we have get the optimal solution α∗ for dual problem (the exact method will be given in the third section). Then from the KKT conditions
- (feasibility) α∗≥0 , y(i)(ω∗⋅X(i)+b∗)≥1 .
- (complementary slackness) αiy(i)(ω∗⋅X(i)+b∗)−1=0
- (stationary) (ω∗,b∗)=argminω,bL(ω,b,α∗)
Then from the derivative of
ω
given above, we have
and from the complementary slackness condition, ∀α∗j>0 , we have
Theorem: We get the optimal solution for primal problem in the following form
And the prediction function is given by
1.2 Un-separable situation
(i) SVM model
Asking the data set is linear separable is a stricte condition. Here I will talk about how to deal with the data set if it is not linear separable.
Introducing slack variables
ξi≥0
for every example
(X(i),y(i))
. And define optimization problem as
(ii) Algorithm
All steps are the same as that in the separable situation, we get the dual optimization problem
And assume we get the optimal solution for dual problem, we have
Theorem:The optimal solutions for non-separable situation are given in the following form
And the prediction function is given by
Note that: In fact there exit βi for every ξi such that β≥0 , βiξi=0 and αi+βi=C . So from the KKT conditions, we have
- When 0<α∗i<C , y(i)(ω∗⋅X(i)+b∗)=1
- When α∗i=C , y(i)(ω∗⋅X(i)+b∗)≤1
- When α∗i=0 , y(i)(ω∗⋅X(i)+b∗)≥1
2. Nonlinear classifier
2.1 SVM model via kernel function
Note that both the optimization problem of SVM and the prediction function in the last section are defined by the inner product of
X(i)⋅X(j)
. So we can use kernel trick to generalize SVM model to a nonlinear classifier.
The core idea of kernel is that transforming
X(i)
into more high dimensional space using
ϕ
and in the high dimensional space, the data set
ϕ(X(i))
is linear separable, then we can built SVM model on
ϕ(X(i))
described in the last section. But to be more efficiency, we can use a symmetric function
K(x,y)
(Called Kernel)to replace the inner product in high dimension
ϕ(x)⋅ϕ(y)
.
The following theorem gives the sufficient and necessary condition for a valid kernel function.
Theorem: A symmetric function
K(x,y)
defined on
S
is a valid kernel function iff
∀m
and
∀X(i)∈S
, the symmetric matrix
Kij=K(X(i),X(j))
is a positive semi-definite matrix.
Common used kernel:
- Polynomial kernel function: K(x,y)=(x⋅y+ℓ)d
- Gaussian kernel function: K(x,y)=exp(−∥x−y∥222σ2)
Then replacing the inner product of
X(i)⋅X(j)
by
Kij=K(X(i),X(j))
, we get the optimization problem for nonlinear classifier
And the prediction function is given by
2.2 SMO Algorithm
SMO is short for sequential minimal optimization, which is a generalization of coordinate ascent method.
Generally speaking, SMO is maximizing two variables simultaneously when others fixed while coordinate ascent method is maximizing only one variable when others fixed.
If we assume α1 and α2 are chosen to be maximized, we can rewrite the dual problem as optimization of α1 and α2
where vi=∑Nj=3αjy(j)Kji and ζ=−∑Ni=3y(i)αi .
The constrains can be seen clearly in the following pictures,
It is clear that
α1
and
α2
are on a line constrained by a block both before and after updating, which means
If y(1)=y(2) , we have
αnew2=αold1+αold2−αnew1∈[αold1+αold2−C,αold1+αold2]
Together with 0≤αnew2≤C , we have
L≤αnew2≤H
where
LH=max{0,αold1+αold2−C}=min{C,αold1+αold2}If y(1)≠y(2) , we have
αnew2=αold2−αold1+αnew1∈[αold2−αold1,αold2−αold1+C]
Together with
0≤αnew2≤C
, we have
where
Then substituting
in the optimization problem, we get
Firstly we get the optimal solution without constrain. Taking derivative w.r.t
α2
and setting it to zero, we have
where g(X)=∑Ni=1αoldiy(i)K(X(i),X) is the prediction value of X and
So the un-truncated solution is
Then truncating it, we get the optimal solution
3. Unconstrain SVM - Hinge loss function
Note that
- ξi=0⇔y(i)(ω⋅X(i)+b)≥1
- ξi>0⇔y(i)(ω⋅X(i)+b)≤1
so
We can rewrite SVM as an unconstrained optimization problem:
which is equivalent to
And the loss function f(x)=[1−x]+ is called hinge loss function, which is approximating 0-1 loss function.
This problem can be optimized by sub-gradient method.
5. Summary
- SVM is a linear classifier and can be extend to nonlinear situation using kernel trick
- SVM is only determined by the support vector. Changing the position of other points do not change SVM.
- From the view of dual optimization problem, we use SMO and KKT to get the optimal solution for primal optimization probelm.