Machine Learning 06 - Support Vector Machine

最新推荐文章于 2025-08-10 17:43:02 发布

原创最新推荐文章于 2025-08-10 17:43:02 发布 · 357 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #人工智能 #斯坦福

机器学习专栏收录该内容

7 篇文章

订阅专栏

本文深入探讨了支持向量机(SVM)的概念及其在大型边界分类中的应用，并介绍了两种常用的核函数：多项式核和高斯核。通过实例说明如何选择地标点、定义核函数并进行训练与评估。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。
鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。

6.1 Large Margin Classification

6.1.1 Optimizaiton objective

Here we intorduce the last supervised algorithm : Support Vector Machine.

Hypothesis :

h θ (x) = {10 if θ T x \geq 0 otherwise

$h_{\theta }(x)=\left\{\begin{matrix} 1 & \text{if}\; \; \theta ^{T}x\geq 0\\ 0 & \text{otherwise} \end{matrix}\right.$

Cost function :

min θ C \sum i = 1 m [y (i) c o s t 1 (θ T x (i)) + (1 - y (i)) c o s t 0 (θ T x (i))] + 1 2 \sum i = 1 n θ 2 j

$\underset{\theta }{\text{min}} \ \;C\sum_{i=1}^{m}\left [ y^{(i)}cost_{1}(\theta ^{T}x^{(i)})+(1-y^{(i)})cost_{0}(\theta ^{T}x^{(i)}) \right ]+\frac{1}{2}\sum_{i=1}^{n}\theta _{j}^{2}$

where $cost_{1}$ is the cost when $y=1$ and $cost_{0}$ is the cost when $y=0$ . An intuitive explanation is below :

cost function

Decision boundary :

decision boundary

SVM will find a line that has the largest margin between the data. And the regularized term $C$ is intuitively show below :

regularized term

6.1.2 Concept of kernels

In this part, in order to fit Non-linear decision boundary, we will adapt the hypothesis function to

h_{θ} (x) = {\begin{cases} 1 & θ_{0} + θ_{1} f_{1} + θ_{2} f_{2} \dots \geq 0 \\ 0 & otherwise \end{cases}

$h_{\theta }(x)=\begin{cases} 1 & \theta _{0} + \theta _{1}f _{1} + \theta _{2}f _{2} \cdots \geq 0 \\ 0 & \text{ otherwise } \end{cases}$

(1) Polynomial

f i = x k i (i, j = 1, 2, \dots)

$f_{i}=x_{i}^{k} (i,j=1,2, \cdots)$

It can fit dataset very well, but we don’t know which features to add and it is very computationally expensive.

(2) Gaussian Kernel
First, choose some landmarks $l^{(i)} ( \ i=1, 2, \cdots)$

Second, define $f_{i}( \ i=1, 2, \cdots)$ , such as Gaussian Kernel :

f i = e x p ⎛ ⎝ - ∥ ∥ x - l ( i ) ∥ ∥ 2 2 σ 2 ⎞ ⎠ = sim (x, l (i))

$f_{i}=exp\left ( -\frac{\left \| x-l^{(i)} \right \|^{2}}{2\sigma ^{2}} \right )=\text{sim}(x, l^{(i)})$
It mesures the similarity of two points :

If $x\approx l^{(i)}: f_{i}\approx 1$ ,
If $x$ is far from $l^{(i)}$ : $f_{i}\approx 0$ .

And the $\sigma$ just like a scale of the distance of two points :

example

Finally, what it perdicet (for example) is :

example

6.1.3 SVM with kernels

(1) Choose landmarks

Given $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots ,(x^{(m)},y^{(m)})$ ,and choose $l^{(1)}=x^{(1)},l^{(2)}=x^{(2)},\cdots ,l^{(m)}=x^{(m)}$ .

(2) Define kernels

We define $f$ as Gaussian Kernel :

f^{(i)} = [\begin{matrix} f_{0}^{(i)} \\ f_{1}^{(i)} \\ ⋮ \\ f_{m}^{(i)} \end{matrix}] = [\begin{matrix} sim (x^{(i)}, l^{(1)}) \\ sim (x^{(i)}, l^{(2)}) \\ \dots \\ sim (x^{(i)}, l^{(m)}) \end{matrix}], i = 1, 2, \dots, m

$f^{(i)}=\begin{bmatrix} f^{(i)}_{0}\\ f^{(i)}_{1}\\ \vdots \\ f^{(i)}_{m} \end{bmatrix}=\begin{bmatrix} \text{sim}(x^{(i)},l^{(1)})\\ \text{sim}(x^{(i)},l^{(2)})\\ \cdots \\ \text{sim}(x^{(i)},l^{(m)}) \end{bmatrix}, i=1,2,\cdots ,m$

(3) Training

min θ C \sum i = 1 m [y (i) c o s t 1 (θ T f (i)) + (1 - y (i)) c o s t 0 (θ T f (i))] + 1 2 \sum j = 1 m θ 2 j

$\underset{\theta }{\text{min}} \ C\sum_{i=1}^{m}\left [ y^{(i)}cost_{1}(\theta ^{T}f^{(i)})+(1-y^{(i)})cost_{0}(\theta ^{T}f^{(i)}) \right ]+\frac{1}{2}\sum_{j=1}^{m}\theta _{j}^{2}$

Use minimization algorirhm to solve it.

(4) Evaluation

Large $C$ :Lower bias, higher variance.
Small $C$ :Higher bias, lower variance.
Large $\sigma^{2}$ : Higher bias, lower variance. ( $f$ is more “smooth”)
Small $\sigma^{2}$ : Lower bias, higher variance.

(5) Note

Perform feature scaling before using the Gaussian Kernel .
Not all similarity functions make valid kernals. (Need to satisfy “Mercer Theorem” to make sure SVM packages run correctly)
Other kernels : Polynomial kernel, String kernel, …
Muti-class classification : one-vs-all method.
If $n\gg m$ , use logistic regression or SVM without kernel; if $n$ is samll, $m$ is intermediate, use SVM with kernel; if $m\gg n$ , create more features, and turn to case one. Neural network likely to work well for most of these things.