在boosting中,集成分类器包含多个非常简单的成员分类器,这些分类器性能稍强于随机猜测(rough rules of thumb),被称为弱学习机。典型的弱分类器是单层决策树。
Adaboost使用整个训练集来训练弱学习机,训练样本在每次迭代中都会被赋予一个新的权重,在上一个学习机错误的基础上进行学习进而构建一个更加强大的分类器。
AdaBoost加权分类器:
F(x)=∑t=1Tαth(x;θt),h(x;θt)∈{−1,1}F(\mathbf{x})=\sum_{t=1}^{T}\alpha_{t}h(\mathbf{x};\theta_{t}), h(\mathbf{x};\theta_{t})\in \{-1,1\}F(x)=t=1∑Tαth(x;θt),h(x;θt)∈{−1,1}
最终决策:f(x)=sign{F(x)}f(\mathbf{x})=sign\{F(\mathbf{x})\}f(x)=sign{F(x)}
误差函数:
argminαt;θt,t:1,2,...,T∑i=1Nexp(−yiF(xi))\arg \min_{\alpha_{t};\theta_{t},t:1,2,...,T}\sum_{i=1}^{N}\exp{(-y_{i}F(\mathbf{x_{i}}))}argαt;θt,t:1,2,...,Tmini=1∑Nexp(−yiF(xi))
当错误分类时,yiF(xi)<0y_{i}F(\mathbf{x}_{i})<0yiF(xi)<0;当正确分类时,yiF(xi)>0y_{i}F(\mathbf{x}_{i})>0yiF(xi)>0
求解推理过程
当迭代到第mmm个分类器时,此时前mmm个分类器的目标函数:
J(α,θ)=∑i=1Nexp(−yi(Fm−1(xi)+αh(xi;θ)))J(\alpha,\theta)=\sum_{i=1}^{N}\exp{(-y_{i}(F_{m-1}(\mathbf{x}_{i})+\alpha h(\mathbf{x}_{i};\theta)))}J(α,θ)=i=1∑Nexp(−yi(Fm−1(xi)+αh(xi;θ)))
(αm,θm)=argminα,θJ(α,θ)(\alpha_{m},\theta_{m})=\arg \min_{\alpha,\theta}J(\alpha,\theta)(αm,θm)=argα,θminJ(α,θ)
对目标函数进化化简:
J(α,θ)=∑i=1Nexp(−yi(Fm−1(xi))⋅exp(−yiαh(xi;θ))=∑i=1Nwim⋅exp(−yiαh(xi;θ))J(\alpha,\theta)=\sum_{i=1}^{N}\exp{(-y_{i}(F_{m-1}(\mathbf{x}_{i}))}\cdot \exp{(-y_{i}\alpha h(\mathbf{x}_{i};\theta))}=\sum_{i=1}^{N}w_{i}^{m}\cdot \exp{(-y_{i}\alpha h(\mathbf{x}_{i};\theta))}J(α,θ)=i=1∑Nexp(−yi(Fm−1(xi))⋅exp(−yiαh(xi;θ))=i=1∑Nwim⋅exp(−yiαh(xi;θ))
求θm=argminθ∑i=1Nwim⋅exp(−yiαh(xi;θ))\theta_{m}=\arg \min_{\theta} \sum_{i=1}^{N}w_{i}^{m}\cdot \exp{(-y_{i}\alpha h(\mathbf{x}_{i};\theta))}θm=argminθ∑i=1Nwim⋅exp(−yiαh(xi;θ))
先不管α\alphaα,让错分样本对应的权重最小化:
θm=argminθ{Pm=∑i=1NwimI(1−yih(xi;θ))}, I(⋅)={0,01,other\theta_{m}=\arg \min_{\theta}\{P_{m}=\sum_{i=1}^{N}w_{i}^{m}I(1-y_{i}h(\mathbf{x}_{i};\theta))\}, \ I(\cdot)=\left\{\begin{matrix}
0,0\\
1,other
\end{matrix}\right.θm=argθmin{Pm=i=1∑NwimI(1−yih(xi;θ))}, I(⋅)={0,01,other
PmP_{m}Pm为第mmm个分类器错误识别样本的权重,选择阈值Pm<0.5P_{m}<0.5Pm<0.5。
求分类器权重αm\alpha_{m}αm:
由J(α,θ)=∑i=1Nwim⋅exp(−yiαh(xi;θm))J(\alpha,\theta)=\sum_{i=1}^{N}w_{i}^{m}\cdot \exp{(-y_{i}\alpha h(\mathbf{x}_{i};\theta_{m}))}J(α,θ)=i=1∑Nwim⋅exp(−yiαh(xi;θm))
分类错误的样本权重之和:∑yih(xi;θm)<0wim=Pm\sum_{y_{i}h(\mathbf{x}_{i};\theta_{m})<0}w_{i}^{m}=P_{m}∑yih(xi;θm)<0wim=Pm,其中yih(xi;θm)=−1y_{i}h(\mathbf{x}_{i};\theta_{m})=-1yih(xi;θm)=−1;分类正确的样本权重之和为∑yih(xi;θm)>0wim=1−Pm\sum_{y_{i}h(\mathbf{x}_{i};\theta_{m})>0}w_{i}^{m}=1-P_{m}∑yih(xi;θm)>0wim=1−Pm,其中yih(xi;θm)=1y_{i}h(\mathbf{x}_{i};\theta_{m})=1yih(xi;θm)=1。
可推出:
αm=argminα[exp(−α)(1−Pm)+exp(α)Pm]\alpha_{m}=\arg \min_{\alpha}[\exp{(-\alpha)(1-P_{m})}+\exp{(\alpha)P_{m}}]αm=argαmin[exp(−α)(1−Pm)+exp(α)Pm]
对以上公式求导数:
[exp(−α)(1−Pm)+exp(α)Pm]′=0[\exp{(-\alpha)(1-P_{m})}+\exp{(\alpha)P_{m}}]^{'}=0[exp(−α)(1−Pm)+exp(α)Pm]′=0
可以解得:
αm=12ln1−PmPm\alpha_{m}=\frac{1}{2}\ln\frac{1-P_{m}}{P_{m}}αm=21lnPm1−Pm
样本权重更新
wi(m+1)=exp(−yiFm(xi))=wi(m)exp(−yiαmh(xi;θm))w_{i}^{(m+1)}=\exp{(-y_{i}F_{m}(\mathbf x_{i}))}=w_{i}^{(m)}\exp{(-y_{i}\alpha_{m}h(\mathbf x_{i};\theta_{m}))}wi(m+1)=exp(−yiFm(xi))=wi(m)exp(−yiαmh(xi;θm))
归一化:wim+1=wim+1/∑i=1Nwim+1w_{i}^{m+1}=w_{i}^{m+1}/\sum_{i=1}^{N}w_{i}^{m+1}wim+1=wim+1/∑i=1Nwim+1
算法伪代码流程如下:
输入:训练数据D={(xi,yi)}i=1N,yi∈{−1,1}D=\{({\rm{x}}_{i},y_{i})\}_{i=1}^{N},y_{i} \in \{-1,1\}D={(xi,yi)}i=1N,yi∈{−1,1}
输出:分类模型F(x)F({\rm{x}})F(x)
1.初始化训练数据的权重向量P1={1N,1N,...,1N}P^{1}=\{\frac{1}{N},\frac{1}{N},...,\frac{1}{N}\}P1={N1,N1,...,N1},并满足∑i=1NPi=1\sum_{i=1}^{N}P_{i}=1∑i=1NPi=1
2.for t=1 toTfor\ t=1\ to Tfor t=1 toT
3. --------基于PtP^{t}Pt训练弱分类器 ht(x)h_{t}({\rm{x}})ht(x)
4. --------计算弱分类器错误率ϵt=∑i=1NPit[[ht(x)≠yi]]\epsilon_{t}=\sum_{i=1}^{N}P^{t}_{i}[[h_{t}({\rm{x}})\neq y_{i}]]ϵt=∑i=1NPit[[ht(x)̸=yi]]
5. --------计算弱分类器权重αt=12ln1−ϵtϵt\alpha_{t}=\frac{1}{2}\ln \frac{1-\epsilon_{t}}{\epsilon_{t}}αt=21lnϵt1−ϵt,此时ϵt\epsilon_{t}ϵt越小,αt\alpha_{t}αt越大,ϵt\epsilon_{t}ϵt是恒大于0.5的。
6. --------重新计算训练数据的权重向量Pit+1=Pitexp(−αtyiht(xi))P^{t+1}_{i}=P^{t}_{i}\exp(-\alpha_{t}y_{i}h_{t}({\rm{x}}_{i}))Pit+1=Pitexp(−αtyiht(xi)),此时判断错的样本权重将会增加,正确的权重将会减小。
7. --------归一化权重向量Pit+1=Pit+1∖∑i=1NPit+1P^{t+1}_{i}=P^{t+1}_{i} \setminus \sum_{i=1}^{N}P^{t+1}_{i}Pit+1=Pit+1∖∑i=1NPit+1
8. end forend\ forend for
9. 最终决策:F(x)=sgn[fT(x)]=sgn(∑t=1Tαtht(x))F({\rm{x}})=sgn[f_{T}({\rm{x}})]=sgn(\sum_{t=1}^{T}\alpha_{t}h_{t}({\rm{x}}))F(x)=sgn[fT(x)]=sgn(∑t=1Tαtht(x))
初始的时候,所有的‘+’和‘-’都是等权重的。第一个分类器后,右边的‘+’分类错误,此时减小分类正确的权重,增大分类错误的权重。第二个分类器满足了此前分错具有较大权重的‘+’,此时左边的‘-’分错,所以增大左边‘-’的权重,同时分类正确的权重较小。最后将所有的分类器按分类器权重组合在一起。
如图3个分类器将决策区域最终划分为了6部分,从左到右,从上到下,分别标号1~6。其中1和6区域是3个分类器没有分歧。2区域蓝色,因为前两个分类器认为是蓝色,第3个认为是非蓝色,基于权重是0.42+0.65=1.07>0.92,所以此区域应该是蓝色。其他部分同理。
以下将使用葡萄酒数据集训练一个AdaBoost分类器,每个分类器都是一个单层的决策树。
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion='entropy', max_depth=1)
ada = AdaBoostClassifier(base_estimator=tree,
n_estimators=500,
learning_rate=0.1,
random_state=0)
tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)
print('Decision tree train/test accuracies: %.3f/%.3f'
% (accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)))
ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)
print('Adaboost tree train/test accuracies: %.3f/%.3f'
% (accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)))
Decision tree train/test accuracies: 0.604/0.563
Adaboost tree train/test accuracies: 0.887/0.845