感知机模型
- 定义:假设输入空间(特征空间)是χ⊆Rn\chi \subseteq R^nχ⊆Rn,输出空间是γ={+1,−1}\gamma=\{+1,-1\}γ={+1,−1}.输入x∈χx \in\chix∈χ表示实例的特征向量,对应于输入空间(特征空间)的点;输出y ∈γ\in \gamma∈γ表示实例的类别。由输入空间到输出空间的函数如下:
f(x)=sign(w⋅x+b)f(x)=sign(w\cdot{x}+b)f(x)=sign(w⋅x+b) 称为感知机模型,其中w,bw,bw,b为感知机的参数,w∈Rnw \in R^nw∈Rn 叫作权值或者权值向量,b∈Rb \in Rb∈R叫做偏值, w⋅xw\cdot{x}w⋅x表示w,bw,bw,b的内积.signsignsign是符号函数,即
sign(x)={+1x≥0−1x<0sign(x)=\begin{cases} +1 &x\ge0\\ -1 & x < 0\\ \end{cases}sign(x)={+1−1x≥0x<0
感知机模型学习策略
- 如果(xi,yi)(x_i,y_i)(xi,yi)是正分类点,则yi∗(w⋅xi+b)>0y_i*(w\cdot{x_i}+b)>0yi∗(w⋅xi+b)>0,如果(xi,yi)(x_i,y_i)(xi,yi)是误分类点,则yi∗(w⋅xi+b)≤0y_i*(w\cdot{x_i}+b)\le0yi∗(w⋅xi+b)≤0
定义:L(w,b)=−∑xi∈Myi∗(w⋅xi+b)L(w,b)=-\sum\limits_{x_i \in M} y_i*(w\cdot{x_i}+b)L(w,b)=−xi∈M∑yi∗(w⋅xi+b),其中MMM是误分类点集合,即误分类点到超平面的距离
感知机学习算法
- 这里我们采用梯度下降法
∇wL(w.b)=−∑xi∈Myi∗xi\nabla_wL(w.b)=-\sum\limits_{x_i \in M}y_i*x_i∇wL(w.b)=−xi∈M∑yi∗xi
∇bL(w.b)=−∑xi∈Myi\nabla_bL(w.b)=-\sum\limits_{x_i \in M}y_i∇bL(w.b)=−xi∈M∑yi
w=w+η∗yi∗xiw=w+\eta*y_i*x_iw=w+η∗yi∗xi
b=b+η∗yib=b+\eta*y_ib=b+η∗yi,其中η\etaη为学习率 - 算法:
(1) 输入:训练数据集合T={(x1,y1),(x2,y2)...(xN,yN)}T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\}T={(x1,y1),(x2,y2)...(xN,yN)},其中xi∈χ=Rn,yi∈γ={−1,+1},x_i \in \chi=R^n,y_i \in\gamma=\{-1,+1\},xi∈χ=Rn,yi∈γ={−1,+1},
i=1,2...N;i=1,2...N;i=1,2...N;学习率为η(0<η≤1)\eta(0<\eta\le1)η(0<η≤1),输出:w,b;f(x)=sign(w⋅x+b)w,b;f(x)=sign(w\cdot{x}+b)w,b;f(x)=sign(w⋅x+b)
(2)在训练集中选取数据(xi,yi)(x_i,y_i)(xi,yi)
(3)如果yi∗(w⋅xi+b)≤0y_i*(w\cdot{x_i}+b)\le0yi∗(w⋅xi+b)≤0
w=w+η∗yi∗xiw=w+\eta*y_i*x_iw=w+η∗yi∗xi
b=b+η∗yib=b+\eta*y_ib=b+η∗yi
(4)转至(2),直到算法结束没有误分类点 - 算法的收敛性证明:
训练数据集合T={(x1,y1),(x2,y2)...(xN,yN)}T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\}T={(x1,y1),(x2,y2)...(xN,yN)},为线性可分,其中xi∈χ=Rn,yi∈γ={−1,+1},i=1,2...N;x_i \in \chi=R^n,y_i \in\gamma=\{-1,+1\},i=1,2...N;xi∈χ=Rn,yi∈γ={−1,+1},i=1,2...N;则
存在满足条件∥w^opt∥=1\lVert{\hat{w}_{opt}\rVert}=1∥w^opt∥=1的超平面w^opt⋅x^+bopt=0\hat{w}_{opt}\cdot{\hat{x}}+b_{opt}=0w^opt⋅x^+bopt=0将训练数据集完全正确分开;且存在γ>0\gamma>0γ>0.对所有
i=1,2,..Ni=1,2,..Ni=1,2,..N
yi∗(w^opt⋅x^)=yi∗(wopt⋅x+bopt)≥γy_i*(\hat{w}_{opt}\cdot{\hat{x}})=y_i*(w_{opt}\cdot{x}+b_{opt})\ge\gammayi∗(w^opt⋅x^)=yi∗(wopt⋅x+bopt)≥γ
令R=max∥x^opt∥R=max\lVert{\hat{x}_{opt}\rVert}R=max∥x^opt∥,则感知机算法,在训练集上的误分类次数kkk满足
k≤(Rγ)2k \le(\frac{R}{\gamma})^2k≤(γR)2- 证明:
(1)
取w^opt\hat{w}_{opt}w^opt,则w^opt∗x=wopt⋅x+bopt=0\hat{w}_{opt}*x=w_{opt}\cdot{x}+b_{opt}=0w^opt∗x=wopt⋅x+bopt=0,使∥wopt∥=1\lVert{w_{opt}}\rVert=1∥wopt∥=1,由于对有限的i=1,2,....Ni=1,2,....Ni=1,2,....N,均有
yi∗(w^opt⋅xi^)=yi∗(wopt⋅xi+bopt)>0y_{i}*(\hat{w}_{opt}\cdot{\hat{x_i}})=y_{i}*(w_{opt}\cdot{x_i}+b_{opt})>0yi∗(w^opt⋅xi^)=yi∗(wopt⋅xi+bopt)>0
所以存在γ=mini{yi∗(wopt⋅xi+bopt)}\gamma=min_i\{y_i*(w_{opt}\cdot{x_i}+b_{opt})\}γ=mini{yi∗(wopt⋅xi+bopt)}
yi∗(w^opt⋅xi^)=yi∗(wopt⋅xi+bopt)≥γy_{i}*(\hat{w}_{opt}\cdot{\hat{x_i}})=y_{i}*(w_{opt}\cdot{x_i}+b_{opt})\ge\gammayi∗(w^opt⋅xi^)=yi∗(wopt⋅xi+bopt)≥γ
(2)因为感知机是从w0^=0\hat{w_0}=0w0^=0开始,如果被误分类,则跟新权重。令w^k−1\hat{w}_{k-1}w^k−1是第k个误分类的扩充向量,
w^k−1=(wk−1T,bk−1)T\hat{w}_{k-1}=(w_{k-1}^T,b_{k-1})^Tw^k−1=(wk−1T,bk−1)T
则第k个误分类实例条件是yi∗(w^k−1⋅xi+bk−1)≤0y_i*(\hat{w}_{k-1}\cdot{x_i}+b_{k-1})\le0yi∗(w^k−1⋅xi+bk−1)≤0
证明两个不等式:
1).w^k⋅w^opt≥k∗η∗γ\hat{w}_{k}\cdot{\hat{w}_{opt}}\ge k*\eta*\gammaw^k⋅w^opt≥k∗η∗γ
w^k⋅w^opt=w^k−1⋅w^opt+η∗yi⋅x^i≥w^k−1⋅w^opt+η∗γ\hat{w}_{k}\cdot{\hat{w}_{opt}}=\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*{y_i}\cdot{\hat{x}_{i}}\ge\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*\gammaw^k⋅w^opt=w^k−1⋅w^opt+η∗yi⋅x^i≥w^k−1⋅w^opt+η∗γ.我们不断的递推
w^k⋅w^opt≥w^k−1⋅w^opt+η∗γ≥w^k−2⋅w^opt+2∗η∗γ≥....≥k∗ηγ\hat{w}_{k}\cdot{\hat{w}_{opt}}\ge\hat{w}_{k-1}\cdot{\hat{w}_{opt}}+\eta*\gamma\ge\hat{w}_{k-2}\cdot{\hat{w}_{opt}}+2*\eta*\gamma\ge....\ge k*\eta\gammaw^k⋅w^opt≥w^k−1⋅w^opt+η∗γ≥w^k−2⋅w^opt+2∗η∗γ≥....≥k∗ηγ
2)∥wk∥\lVert{w_{k}}\rVert∥wk∥2≤k∗η2∗R2^2\le k*\eta^2*R^22≤k∗η2∗R2
∥wk∥2\lVert{w_{k}}\rVert^2∥wk∥2=∥wk−1∥2+2∗η∗yi∗w^k−1⋅x^i+η2∗∥x^i∥≤∥wk−1∥2=\lVert{w_{k-1}}\rVert^2+2*\eta*y_i*\hat{w}_{k-1}\cdot{\hat{x}_{i}}+\eta^2*\lVert{\hat{x}_{i}}\rVert\le \lVert{w_{k-1}}\rVert^2=∥wk−1∥2+2∗η∗yi∗w^k−1⋅x^i+η2∗∥x^i∥≤∥wk−1∥2+η2∗∥x^i∥≤∥wk−1∥2+\eta^2*\lVert{\hat{x}_{i}}\rVert\le \lVert{w_{k-1}}\rVert^2+η2∗∥x^i∥≤∥wk−1∥2+η2∗R≤+\eta^2*{R}\le+η2∗R≤∥wk−2∥2\lVert{w_{k-2}}\rVert^2∥wk−2∥2+2∗η2∗R≤....≤k∗η2∗R2+2*\eta^2*{R}\le....\le k*\eta^2*R^2+2∗η2∗R≤....≤k∗η2∗R2
证明完毕,结合以上两个不等式
k∗η∗γ≤w^k⋅w^opt≤∣∣w^k∣∣∗∣∣w^opt∣∣≤k∗η∗Rk*\eta*\gamma\le\hat{w}_k\cdot{\hat{w}_{opt}}\le||\hat{w}_k||*||\hat{w}_{opt}||\le\sqrt{k}*\eta*Rk∗η∗γ≤w^k⋅w^opt≤∣∣w^k∣∣∗∣∣w^opt∣∣≤k∗η∗R
k2γ2≤k∗R2k^2\gamma^2\le k*R^2k2γ2≤k∗R2
k≤(Rγ)2k\le(\frac{R}{\gamma})^2k≤(γR)2,完毕
- 证明:
- 对偶形式
w=w+η∗yi∗xiw=w+\eta*y_i*x_iw=w+η∗yi∗xi
b=b+η∗yib=b+\eta*y_ib=b+η∗yi
w=∑i=1Nai∗yi∗xiw=\sum\limits_{i=1}^Na_{i}*y_i*x_iw=i=1∑Nai∗yi∗xi
b=∑i=1Nai∗yib=\sum\limits_{i=1}^Na_{i}*y_ib=i=1∑Nai∗yi
其中NNN为训练数据数量,ai=ni∗η≥0a_i=n_{i}*\eta\ge0ai=ni∗η≥0- 算法:
(1) 输入:训练数据集合T={(x1,y1),(x2,y2)...(xN,yN)}T=\{(x_1,y_1),(x_2,y_2)...(x_N,y_N)\}T={(x1,y1),(x2,y2)...(xN,yN)},其中xi∈χ=Rn,x_i \in \chi=R^n,xi∈χ=Rn,
yi∈γ={−1,+1},输出a.b.f(x)=sign(∑j=1Naj∗yj∗xj⋅x+b).a=(a1,a2,...,an)y_i \in\gamma=\{-1,+1\},输出a.b.f(x)=sign(\sum\limits_{j=1}^Na_j*y_j*x_j\cdot{x}+b).a=(a_1,a_2,...,a_n)yi∈γ={−1,+1},输出a.b.f(x)=sign(j=1∑Naj∗yj∗xj⋅x+b).a=(a1,a2,...,an)
i=1,2...N;i=1,2...N;i=1,2...N;学习率为η(0<η≤1)\eta(0<\eta\le1)η(0<η≤1)
(2)在训练集中选取数据(xi,yi)(x_i,y_i)(xi,yi)
(3)如果sign(∑j=1Naj∗yj∗xj⋅xi+b)≤0sign(\sum\limits_{j=1}^Na_j*y_j*x_j\cdot{x_i}+b)\le0sign(j=1∑Naj∗yj∗xj⋅xi+b)≤0
ai=ai+ηa_i=a_i+\etaai=ai+η
b=b+η∗yib=b+\eta*y_ib=b+η∗yi
(4)转至(2),直到算法结束没有误分类点
- 算法:
- Gram矩阵:G=[xi⋅xj]N×NG=[x_i\cdot{x_j}]_{N×N}G=[xi⋅xj]N×N
实现代码后续补上