Perceptron

最新推荐文章于 2025-03-07 21:58:16 发布

艳艳儿

最新推荐文章于 2025-03-07 21:58:16 发布

阅读量1.3k

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/COMEYAN/article/details/50416717

Perceptron
- Primal problem
- Dual form

Perceptron

Perceptron is a basic model for neural network and SVM(support vector machine). It is just a linear classification often used in the separated two-class data.
这里写图片描述

1. Primal problem

(i) Distance between a point and a hyperplane

图片描述

Since $x_0$ is on the hyperplane, it satisfies $w \cdot x_0 + b=0$ $(\|w\|_2=1)$ .
$b = - w \cdot x 0$ $b = -w \cdot x_0$ .
Then project vector $x_1-x_0$ on $w$ , which gives the distance between $x_1$ and hyperplane $w\cdot x+b=0$ ,
$w \cdot (x 1 - x 0) = w \cdot x 1 - w \cdot x 0 = w \cdot x 1 + b$ $w\cdot (x_1-x_0) = w\cdot x_1 - w\cdot x_0 = w\cdot x_1+b$
Note that if $x_1$ is on the other side of the hyperplane, the distance will be in the form of
$w \cdot (x 0 - x 1) = w \cdot x 0 - w \cdot x 1 = - (w \cdot x 1 + b)$ $w\cdot (x_0-x_1) = w\cdot x_0 - w\cdot x_1 = -(w\cdot x_1+b)$

(ii) Perceptron model

Assume that we have a data set $\{(x_1,y_1),\cdots,(x_N,y_N))\}$ , and $y_i = 1$ or $y_i=-1$ . In addition, the data set can be separated by a hyperplane. Then we use the distances between misclassification data to the hyperplane as loss function.

$L(w, b) = - \sum_{x_i\in M}\underbrace{y_i(w\cdot x_i+b)}_{\text{negative if misclassifized}}$

(iii) Algorithm of perceptron

To minimize the loss function

$L(\omega, b) = - \sum_{x_i\in M}\underbrace{y_i(w\cdot x_i+b)}_{\text{negative if misclassifized}}$
stochastic gradient descent method is used.
Note that the derivatives of

$(w,b)$ are

$\begin{align} \frac{\partial L(w,b)}{\partial w} &= - \sum_{x_i\in M} x_i y_i\\ \frac{\partial L(w,b)}{\partial b} &= - \sum_{x_i\in M} y_i \end{align}$

Suppose now the parameters are $(w_{k-1}, b_{k-1})$
Select a data $x_i$ which is misclassifized by hyperplane $w_{k-1}\cdot x_i+b_{k-1}=0$ , then update parameters
$\begin{align} w_k &= w_{k-1}+\eta x_i y_i\\ b_k&=b_{k-1}+\eta y_i \end{align}$

(iv) Convergence of algorithm

If we denote $\hat x_i = (1,x_i)$ and $\hat w = (w,b)$ , the hyperplace can be rewritten as

$\hat w \cdot \hat x = 0$
Theorem: Let $R = \max_{i}\|\hat x_i\|_2$ , and there exists a constant $\gamma$ satisfies $y_i(\hat w_{opt}\cdot \hat x_i)\geq \gamma$ for all $i$ (since the data set are separated). Then the maximum steps to be used in the stochastic gradient descent satisfies

$k \leq \left(\frac{R}{\gamma}\right)^2$

Proof:(1) Combining the updating of $w$

$\hat w_k= \hat w_{k-1}+\eta y_i\hat x_i$ and the separated information

$y_i(\hat w_{opt}\cdot \hat x_i)\geq \gamma$ ,
we have

$\hat w_k\cdot \hat w_{opt} = \hat w_{k-1}\cdot \hat w_{opt} +\eta y_i(\hat x_i \cdot \hat w_{opt} )\geq \hat w_{k-1}\cdot \hat w_{opt} +\eta \gamma\geq \cdots\geq k\eta \gamma$
(2) Recall than in the updating form

$\hat w_k= \hat w_{k-1}+\eta y_i\hat x_i$ , the data

$(\hat x_i, y_i)$ are misclassifized by hyperplane

$\hat w_{k-1} \cdot \hat x=0$ , which says

$y_i(\hat w_{k-1} \cdot \hat x_i)<0$
Then

$\begin{align} \|\hat w_k\|_2^2 &= \|\hat w_{k-1}+\eta y_i\hat x_i\|_2^2\\ &=\|\hat w_{k-1}\|_2^2+\|\eta y_i\hat x_i\|_2^2+2\eta y_i(\hat w_{k-1} \cdot \hat x_i) \\ &< \|\hat w_{k-1}\|_2^2+\|\eta y_i\hat x_i\|_2^2\\ &\leq \|\hat w_{k-1}\|_2^2+\eta^2 R^2\\ &\leq \cdots\\ &\leq k\eta^2R^2 \end{align}$
(3) Combining the two results from (1) and (2), we have

$k\eta \gamma \leq \hat 2_k \hat w_{opt}\leq\|\hat w_k\|_2\|\hat w_{opt}\|_2<\sqrt{ k\eta^2R^2}$
So

$k< \left(\frac{R}{\gamma}\right)^2$

2. Dual form

We can use the dual form to eliminate the computations, which means using $(x_i,y_i)$ to represent $w$ .

In last section, we see the updating form of stochastic gradient descent is $w_k = w_{k-1}+\eta x_i y_i$ and $b_k = b_{k-1}+\eta y_i$ . If we denote the number of updating each data $(x_i,y_i)$ in the algorithm as $n_i$ , we have

$\begin{align} w_{\mathbb{final}}&=\sum_{i=1}^N\underbrace{ n_i \eta}_{\alpha_i} x_i y_i=\sum_{i=1}^N \alpha_i x_i y_i\\ b_{\mathbb{final}}&=\sum_{i=1}^N n_i \eta y_i=\sum_{i=1}^N \alpha_i y_i\\ \end{align}$

So we can rewrite the model as

$L(\alpha) =-\sum_{x_j\in M}y_j\left(\sum_{i=1}^N \alpha_iy_i x_i \cdot x_j+\sum_{i=1}^N \alpha_i y_i\right)$

We also use stochastic gradient descent method to update $\alpha$ .

Denote now $\alpha_{k-1} = (\alpha_{k-1}{}_1\cdots \alpha_{k-1}{}_N)$
Select a misclassifized data $(x_j, y_j)$ , which satisfies
$y_j\left(\sum_{i=1}^N \alpha_iy_i x_i \cdot x_j+\sum_{i=1}^N \alpha_i y_i\right)<0$
then update $\alpha$ as
$\begin{align} \alpha_{k}{}_{j} &= \alpha_{k-1}{}_{j}+\eta\\ \alpha_{k}{}_{\ell} &= \alpha_{k-1}{}_{\ell} \text{ for all $\ell\neq j$} \end{align}$