AdaBoost的目标函数为最小化指数损失函数 L(H∣D)=Ex∼D[e−f(x)H(x)]\mathcal{L}(H|D)=E_{\bm{x}\sim D}[e^{-f(\bm{x})H(\bm{x})}]L(H∣D)=Ex∼D[e−f(x)H(x)]
其中 DDD 表示样本数据 x\bm{x}x 初始的权值分布,f(x)f(\bm{x})f(x) 表示样本 x\bm{x}x 的真实类别,H(x)=∑t=1Tαtht(x)H(\bm{x})=\sum_{t=1}^{T}\alpha_th_t(\bm{x})H(x)=∑t=1Tαtht(x) 为基学习器 ht(x)h_t(\bm{x})ht(x) 的线性组合, TTT 为基学习器的个数
在训练第 ttt 个基学习器的时候,已经得出前 t−1t-1t−1个基学习器,于是 L(Ht∣D)min=argminHt(x)Ex∼D[e−f(x)Ht(x)]=argminht(x),αtEx∼D[e−f(x)Ht−1(x)−f(x)αtht(x)]\mathcal{L}(H_t|D)_{min}=\arg \min \limits_{H_t(\bm{x})}E_{\bm{x}\sim D}[e^{-f(\bm{x})H_t(\bm{x})}]=\arg \min \limits_{h_t(\bm{x}),\alpha_t}E_{\bm{x}\sim D}[e^{-f(\bm{x})H_{t-1}(\bm{x})-f(\bm{x})\alpha_th_t(\bm{x})}]L(Ht∣D)min=argHt(x)minEx∼D[e−f(x)Ht(x)]=arght(x),αtminEx∼D[e−f(x)Ht−1(x)−f(x)αtht(x)]。
实际上 Dt(x)=D(x)e−f(x)Ht−1(x)Ex∼D[e−f(x)Ht−1(x)]D_t(x)=\frac{D(x)e^{-f(\bm{x})H_{t-1}(\bm{x})}}{E_{\bm{x}\sim D}[e^{-f(\bm{x})H_{t-1}(\bm{x})}]}Dt(x)=Ex∼D[e−f(x)Ht−1(x)]D(x)e−f(x)Ht−1(x)
于是 Ex∼D[e−f(x)Ht−1(x)−f(x)αtht(x)]=∑xD(x)e−f(x)Ht−1(x)⋅e−f(x)αtht(x)=Ex∼D[e−f(x)Ht−1(x)]⋅∑xD(x)e−f(x)Ht−1(x)Ex∼D[e−f(x)Ht−1(x)]⋅e−f(x)αtht(x)=Ex∼D[e−f(x)Ht−1(x)]⋅Ex∼Dt[e−f(x)αtht(x)]E_{\bm{x}\sim D}[e^{-f(\bm{x})H_{t-1}(\bm{x})-f(\bm{x})\alpha_th_t(\bm{x})}]=\sum_{\bm{x}}D(\bm{x})e^{-f(\bm{x})H_{t-1}(\bm{x})}\cdot e^{-f(\bm{x})\alpha_th_t(\bm{x})}=E_{\bm{x}\sim D}[e^{-f(\bm{x})H_{t-1}(\bm{x})}]\cdot \sum_{\bm{x}}\frac{D(\bm{x})e^{-f(\bm{x})H_{t-1}(\bm{x})}}{E_{\bm{x}\sim D}[e^{-f(\bm{x})H_{t-1}(\bm{x})}]}\cdot e^{-f(\bm{x})\alpha_th_t(\bm{x})}=E_{\bm{x}\sim D}[e^{-f(\bm{x})H_{t-1}(\bm{x})}]\cdot E_{x\sim D_t}[e^{-f(\bm{x})\alpha_th_t(\bm{x})}]Ex∼D[e−f(x)Ht−1(x)−f(x)αtht(x)]=∑xD(x)e−f(x)Ht−1(x)⋅e−f(x)αtht(x)=Ex∼D[e−f(x)Ht−1(x)]⋅∑xEx∼D[e−f(x)Ht−1(x)]D(x)e−f(x)Ht−1(x)⋅e−f(x)αtht(x)=Ex∼D[e−f(x)Ht−1(x)]⋅Ex∼Dt[e−f(x)αtht(x)],因为 Ex∼D[e−f(x)Ht−1(x)]E_{\bm{x}\sim D}[e^{-f(\bm{x})H_{t-1}(\bm{x})}]Ex∼D[e−f(x)Ht−1(x)] 是常数,所以 argminH(x)Ex∼D[e−f(x)H(x)]=argminht(x),αtEx∼Dt[e−f(x)αtht(x)]\arg \min \limits_{H(\bm{x})}E_{\bm{x}\sim D}[e^{-f(\bm{x})H(\bm{x})}]=\arg \min \limits_{h_t(\bm{x}),\alpha_t}E_{\bm{x}\sim D_t}[e^{-f(\bm{x})\alpha_th_t(\bm{x})}]argH(x)minEx∼D[e−f(x)H(x)]=arght(x),αtminEx∼Dt[e−f(x)αtht(x)]
即等价于最小化第 ttt 个基学习器的指数损失函数,不妨令 ϵt=Ex∼Dt[I(f(x)≠ht(x))]\epsilon_t=E_{x\sim D_t}[I(f(\bm{x})\neq h_t(\bm{x}))]ϵt=Ex∼Dt[I(f(x)̸=ht(x))] 为第 ttt 个基学习器的错误率,且由 f(x)ht(x)=1−2I(f(x)≠ht(x))f(\bm{x})h_t(\bm{x})=1-2I(f(\bm{x})\neq h_t(\bm{x}))f(x)ht(x)=1−2I(f(x)̸=ht(x))
于是 argminht(x),αtEx∼Dt[e−f(x)αtht(x)]=argminht(x),αtEx∼Dt[eαt[2I(f(x)≠ht(x))−1]]=argminht(x),αtEx∼Dte−αt⋅(e2αtI(f(x)≠ht(x))+I(f(x)=ht(x)))=argminϵte−αt⋅(e2αtϵt+1−ϵt)=argminϵt(eαt−e−αt)ϵt+e−αt\arg \min \limits_{h_t(\bm{x}),\alpha_t}E_{\bm{x}\sim D_t}[e^{-f(\bm{x})\alpha_th_t(\bm{x})}] =\arg \min \limits_{h_t(\bm{x}),\alpha_t}E_{\bm{x}\sim D_t}[e^{\alpha_t[2I(f(\bm{x})\neq h_t(\bm{x}))-1]}]=\arg \min \limits_{h_t(\bm{x}),\alpha_t}E_{\bm{x}\sim D_t}e^{-\alpha_t}\cdot (e^{2\alpha_t}I(f(\bm{x})\neq h_t(\bm{x}))+I(f(\bm{x})= h_t(\bm{x})))=\arg \min \limits_{\epsilon_t}e^{-\alpha_t}\cdot (e^{2\alpha_t}\epsilon_t+1-\epsilon_t)=\arg \min \limits_{\epsilon_t}(e^{\alpha_t}-e^{-\alpha_t})\epsilon_t+e^{-\alpha_t}arght(x),αtminEx∼Dt[e−f(x)αtht(x)]=arght(x),αtminEx∼Dt[eαt[2I(f(x)̸=ht(x))−1]]=arght(x),αtminEx∼Dte−αt⋅(e2αtI(f(x)̸=ht(x))+I(f(x)=ht(x)))=argϵtmine−αt⋅(e2αtϵt+1−ϵt)=argϵtmin(eαt−e−αt)ϵt+e−αt,即最小化第 ttt 个基分类器的错误率,假设已知 ϵt\epsilon_tϵt,令 l(αt)=(eαt−e−αt)ϵt+e−αtl(\alpha_t)=(e^{\alpha_t}-e^{-\alpha_t})\epsilon_t+e^{-\alpha_t}l(αt)=(eαt−e−αt)ϵt+e−αt,当 ∂l∂αt=0⇔αt=12ln(1−ϵtϵt)\frac{\partial{l}}{\partial{\alpha_t}}=0\Leftrightarrow \alpha_t=\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})∂αt∂l=0⇔αt=21ln(ϵt1−ϵt)。