[Machine Learning] SVM--support vector machine

本文详细介绍了支持向量机(SVM)的原理与应用,包括线性和非线性分类情况下的模型建立与求解算法,如SMO算法,并探讨了非约束SVM及合页损失函数。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. Linear classifier

1.1 Separable situation

(i) SVM model

Given a data set (X(i),y(i))Ni=1 with X(i)Rp and y(i){1,1} , when the two classes are separable, SVM tends to get a linear classifier that creates largest margin.

Then we can immediately get the optimization problem:

maxsubject to:Cy(i)(ωX(i)+b)ω2geometric marginC for all i

which can be rewritten as
maxsubject to:Cy(i)(ωX(i)+b)Cω2γ: functional margin for all i

which is equivalent to
maxsubject to:γω2y(i)(ωX(i)+b)γ for all i

Note that if (ω,b) scales to (λω,λb) , the functional margin γ scales to λγ . So the constrain and the objective function γω2 do not change, which says this optimization problem is free of scaling. Then we choose suitable scaling to make functional margin to be 1 and the corresponding geometric margin to be 1ω2 ), then the optimization problem will be

minsubject to:12ω22y(i)(ωX(i)+b)1 for all i

which is a convex optimization problem.

(ii) Algorithm

Now I will summarize the steps to get the solution of SVM model. What we use is the idea of Lagrange multiplier method. The Lagrange function of last problem is

L(ω,b,α)=12ω22+i=1Nαii=1Nαiy(i)(ωX(i)+b)

And the primal problem can be written as
minω,bmaxα0L(ω,b,α)

then the dual problem will be
maxα0minω,bL(ω,b,α)

From now on, we will focus to solve the dual problem rather than primal problem. Firstly, we need to solve α0

minω,bL(ω,b,α)

Get the derivatives w.r.t (ω,b) and set them to zero, we get
L(ω,b,α)ωL(ω,b,α)b=ωi=1Nαiy(i)X(i)=0=i=1Nαiy(i)=0

Then substituting these two equations into Lagrange function, we get the final dual optimization problem

maxsubject to:i=1Nαi12i=1Nj=1Nαiαjy(i)y(j)X(i)X(j)i=1Nαiy(i)=0αi0 for all i

Assume we have get the optimal solution α for dual problem (the exact method will be given in the third section). Then from the KKT conditions

  • (feasibility) α0 , y(i)(ωX(i)+b)1 .
  • (complementary slackness) αiy(i)(ωX(i)+b)1=0
  • (stationary) (ω,b)=argminω,bL(ω,b,α)

Then from the derivative of ω given above, we have

ω=i=1Nαiy(i)X(i)

and from the complementary slackness condition, αj>0 , we have
y(j)(ωX(j)+b)=1b=y(j)ωX(j)

Theorem: We get the optimal solution for primal problem in the following form

ωb=i=1Nαiy(i)X(i)=y(j)ωX(j)for αj>0

And the prediction function is given by
y^=ωX+b=i=1Nαiy(i)X(i)X+b

1.2 Un-separable situation

(i) SVM model

Asking the data set is linear separable is a stricte condition. Here I will talk about how to deal with the data set if it is not linear separable.

Introducing slack variables ξi0 for every example (X(i),y(i)) . And define optimization problem as

minsubject to:12ω22+Ci=1Nξipenalty of ξy(i)(ωX(i)+b)1ξifor all iξi0for all i

(ii) Algorithm

All steps are the same as that in the separable situation, we get the dual optimization problem

maxsubject to:i=1Nαi12i=1Nj=1Nαiαjy(i)y(j)X(i)X(j)i=1Nαiy(i)=00αiC for all i

And assume we get the optimal solution for dual problem, we have
Theorem:The optimal solutions for non-separable situation are given in the following form

ωb=i=1Nαiy(i)X(i)=y(j)ωX(j)for 0<αj<C

And the prediction function is given by
y^=ωX+b=i=1Nαiy(i)X(i)X+b

Note that: In fact there exit βi for every ξi such that β0 , βiξi=0 and αi+βi=C . So from the KKT conditions, we have

  • When 0<αi<C , y(i)(ωX(i)+b)=1
  • When αi=C , y(i)(ωX(i)+b)1
  • When αi=0 , y(i)(ωX(i)+b)1

2. Nonlinear classifier

2.1 SVM model via kernel function

Note that both the optimization problem of SVM and the prediction function in the last section are defined by the inner product of X(i)X(j) . So we can use kernel trick to generalize SVM model to a nonlinear classifier.

The core idea of kernel is that transforming X(i) into more high dimensional space using ϕ and in the high dimensional space, the data set ϕ(X(i)) is linear separable, then we can built SVM model on ϕ(X(i)) described in the last section. But to be more efficiency, we can use a symmetric function K(x,y) (Called Kernel)to replace the inner product in high dimension ϕ(x)ϕ(y) .

The following theorem gives the sufficient and necessary condition for a valid kernel function.
Theorem: A symmetric function K(x,y) defined on S is a valid kernel function iff m and X(i)S , the symmetric matrix Kij=K(X(i),X(j)) is a positive semi-definite matrix.

Common used kernel:

  • Polynomial kernel function: K(x,y)=(xy+)d
  • Gaussian kernel function: K(x,y)=exp(xy222σ2)

Then replacing the inner product of X(i)X(j) by Kij=K(X(i),X(j)) , we get the optimization problem for nonlinear classifier

maxsubject to:i=1Nαi12i=1Nj=1Nαiαjy(i)y(j)Kiji=1Nαiy(i)=00αiC for all i

And the prediction function is given by
y^=i=1Nαiy(i)K(X(i),X)+b

2.2 SMO Algorithm

SMO is short for sequential minimal optimization, which is a generalization of coordinate ascent method.

Generally speaking, SMO is maximizing two variables simultaneously when others fixed while coordinate ascent method is maximizing only one variable when others fixed.

If we assume α1 and α2 are chosen to be maximized, we can rewrite the dual problem as optimization of α1 and α2

maxsubject toα1+α212K11α2112K22α22K12y(1)y(2)α1α2v1y(1)α1v2y(2)α2y(1)α1+y(2)α2=ζ0αiCfor i=1,2

where vi=Nj=3αjy(j)Kji and ζ=Ni=3y(i)αi .

The constrains can be seen clearly in the following pictures,

It is clear that α1 and α2 are on a line constrained by a block both before and after updating, which means

y(1)αold1+y(2)αold2=ζ=y(1)αnew1+y(2)αnew2

  • If y(1)=y(2) , we have

    αnew2=αold1+αold2αnew1[αold1+αold2C,αold1+αold2]

    Together with 0αnew2C , we have
    Lαnew2H

    where
    LH=max{0,αold1+αold2C}=min{C,αold1+αold2}

  • If y(1)y(2) , we have

    αnew2=αold2αold1+αnew1[αold2αold1,αold2αold1+C]

Together with 0αnew2C , we have

Lαnew2H

where
LH=max{0,αold2αold1}=min{C,αold2αold1+C}

Then substituting

α1=y(1)(ζy(2)α2)

in the optimization problem, we get
maxsubject toy(2)(y(2)y(1)+ζK11ζK12+v1v2)α212(K11+K222K12)α22Lα2H

Firstly we get the optimal solution without constrain. Taking derivative w.r.t α2 and setting it to zero, we have

(K11+K222K12)α2=y(2)(y(2)y(1)+ζK11ζK12+v1v2)=y(2)(y(2)y(1)+ζK11ζK12+(g(X(1))i=12αoldiy(i)Ki1b)(g(X(2))i=12αoldiy(i)Ki2b))=y(2)(E1E2+ζK11ζK12i=12αoldiy(i)Ki1+i=12αoldiy(i)Ki2)=y(2)(E1E2+(y(1)αold1+y(2)αold2)(K11K12)i=12αoldiy(i)Ki1+i=12αoldiy(i)Ki2)=y(2)(E1E2+y(2)αold2(K11+K222K12))=y(2)(E1E2)+αold2(K11+K222K12)

where g(X)=Ni=1αoldiy(i)K(X(i),X) is the prediction value of X and Ei=g(X(i))y(i)

So the un-truncated solution is

αnew,unc2=αold2+E1E2K11+K222K12y(2)
is the prediction eror.

Then truncating it, we get the optimal solution

αnew2=Lαnew,unc2Hαnew,unc2<LLαnew,unc2Hαnew,unc2>H

3. Unconstrain SVM - Hinge loss function

Note that

  • ξi=0y(i)(ωX(i)+b)1
  • ξi>0y(i)(ωX(i)+b)1

so

ξi=[1y(i)(ωX(i)+b)]+

We can rewrite SVM as an unconstrained optimization problem:
min12ω22+Ci=1N[1y(i)(ωX(i)+b)]+

which is equivalent to
mini=1N[1y(i)(ωX(i)+b)]++λω22

And the loss function f(x)=[1x]+ is called hinge loss function, which is approximating 0-1 loss function.
这里写图片描述
This problem can be optimized by sub-gradient method.

5. Summary

  • SVM is a linear classifier and can be extend to nonlinear situation using kernel trick
  • SVM is only determined by the support vector. Changing the position of other points do not change SVM.
  • From the view of dual optimization problem, we use SMO and KKT to get the optimal solution for primal optimization probelm.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值