集成学习中的AdaBoost

最新推荐文章于 2025-08-22 18:36:08 发布

原创最新推荐文章于 2025-08-22 18:36:08 发布 · 164 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#集成学习 #AdaBoost #python

机器学习专栏收录该内容

6 篇文章

订阅专栏

本文探讨了集成学习的基本概念，介绍了集成学习的两种主要方法：序列化方法Boosting和并行化方法Bagging及随机森林。重点解析了AdaBoost算法的原理与流程，包括训练数据权重调整、弱分类器组合以及算法的具体实现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学习《统计学习方法》、《机器学习实战》以及《机器学习》有关集成方法的内容，做的学习笔记。以便复习。

集成学习的介绍

集成学习的基本思想：集成学习通过构建并结合多个学习器来完成学习任务。
集成学习的一般结构：先产生一组“个体学习器”，再用某种策略将他们结合起来。
利用通俗的话来说：我们想象一个决定场景。由多个专家做出决断总会比单个专家做出的决断更具有准确性。个体学习器就代表的是单个专家，综合考虑多个专家的意见最后做出决断，就是结合不同的个体学习器，最后得到结果的过程。

关于个体学习器

个体学习器就是基学习器，通常基学习器是一个弱学习器。因为获取一个若学习器的算法通常比强学习器容易的多。
个体学习器必须具有以下的两个特性
1. 个体学习器必须具有一定的准确性，学习器不能太坏。
2. 多样性，学习器应该具有差异。

基于个体学习器的特点，我们可以发现，集成学习研究的核心是如何产生“好而不同”的个体学习器。
根据个体学习器的生成方式，目前的集成学习方法大致可以分成两类，

个体学习器存在强的依赖关系、必须串行生成序列化方法。代表是Boosting。
个体学习器不存在强的依赖关系、可以同时生成并行化方法。代表是Bagging、随机森林(Random Forest)。

AdaBoost算法

boosting算法有多个版本，AdaBoost是最流行的算法。
对于提升方法来说：有两个问题需要解决：
1. 一是在每一轮如何改变训练数据的权值和概率分布；
2. 二是如何将弱分类器组合成一个强分类器。
AdaBoost是这样解决上面的两个问题的。
1. 提高那些被前一轮弱分类器错误分类样本的权值，而降低那些被正确分类的样本的权值。由于权值的增大，被错误分类的样本会受到弱分类器的的更大的关注。
2. 弱分类器的组合，采取多数表决的方法，具体的是加大分类误差率小的弱分类器的权值，使其在表决中起较大的作用，减小分类误差率较大的分类器的权值，使其在表决中其较小的作用。

AdaBoost的算法流程

输入：训练数据集 $T={(x_{1},y_{1}),(x_{2},y_{2}),...,(x_{N},y_{N})}$ ,其中 $xi∈X⊆Rnx_{i}\in \mathit{X} \subseteq \boldsymbol{R}^{n}$ , $yi∈Y=−1,+1y_{i}\in \mathit{Y}={-1,+1}$ ;弱分类算法；
输出：最终分类器 $G (x)$ .
(1)初始化训练数据的权重分布
$D_{1} = (w_{11},...w_{1i},...,w_{1N}),\, w_{1i} = \frac{1}{N},\, i=1,2,..,N$
(2) 对 $m = 1, 2, . . ., M$ 。
(a)使用具有权值分布 $D_{m}$ 的训练数据集学习，得到基本的分类器。
$G_{m}(x) : \mathit{X}\rightarrow{-1,+1}$
(b)计算 $G_{m}(x)$ 在训练数据集上的分类误差率：
$e_{m} = \sum^{N}_{i=1}P(G_{m}(x_{i})\neq y_{i}) \, = \sum^{N}_{i=1}w_{mi}I(G_{m}(x_{i}\neq y_{i})) \, = \sum_{G_{m}(x_{i}) \neq y_{i}}w_{mi}$
(这里 $w_{mi}$ 表示第m轮中第i个实例的权值， $∑i=1Nwmi=1\sum_{i=1}^{N}w_{mi}=1$ 。由上面的公式我们可以看出 $G_{m}(x)$ 在加权数据集的分类误差率是被 $G_{m}(x)错误分类的样本的权值的和。$ )
©计算 $G_{m}(x)$ 的系数：
$\alpha_{m} = \frac{1}{2} log \frac{1-e_{m}}{e_{m}}$
( $αm\alpha_{m}$ 表示 $G_{m}(x)$ 在最终的分类器的重要性，可以看出 $αm\alpha_{m}$ 随着 $e_{m}$ 的减小而增大，所以分类误差率越小的基本分类器在最终的作用越大。)
(d)更新训练数据集的权重分布：
$D_{m+1} = (w_{m+1,1},...,w_{m+1，i},...,w_{m+1,N}) \\ w_{m+1,i} = \frac{w_{mi}}{Z_{m}}exp(-\alpha_{m}y_{i}G_{m}(x_{i})),\, i = 1,2,...,N$
这里， $Z_{m}表示规范因子$ ：
$Z_{m} = \sum^{N}_{i=1}w_{mi}exp(-\alpha_{m}y_{i}G_{m}(x_{i}))$
上述公式也可以表示成：
$w_{m+1,i} = \left\{\begin{matrix} \frac{w_{mi}}{Z_{m}} e^{-\alpha_{m}}& , G_{m}(x_{}i) = y_{i}\\ \frac{w_{mi}}{Z_{m}}e^{\alpha_{m}}& , G_{m} \neq y_{i} \end{matrix}\right.$
(观察上面的公式我们可以发现，被基分类器 $G_{m}(x)$ 误分类的样本的权值得以扩大，而被正确分类的样本的权值却得以缩小。因此，误分类的样本在下一轮的学习中会起更大的作用。）
(3)构建基本分类器的线性组合：
$\sum_{m=1}^{M} \alpha _{m}G_{m}(x)$
最终得到分类器：
$sign(\sum^{M}_{m=1}\alpha_{m}G_{m}(x))$

AdaBoost应用实例

以单层决策树为基学习器，实现AdaBoost的例子。首先我们生成一下的数据集及其标签。

1，建造初始数据集

import numpy as np
"""
建造初始数据集
"""
def loadSimpData():
    datMat = np.matrix([[ 1. ,  2.1],
        [ 2. ,  1.1],
        [ 1.3,  1. ],
        [ 1. ,  1. ],
        [ 2. ,  1. ]])
    classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
    return datMat,classLabels

datMat,classLabels = loadSimpData()

2.单层决策树的生成函数

def stumpClassify(dataMatrix, dimen, threshVal, threshIneq):
    '''
    通过阈值比较对数据进行分类
    dataMatrix： np.matrix(list[list[int]])
    dimen: 0: 表示线x = a; 1:表示线y = b
    threshVal: int ,表示阈值
    threshIneq:['lt','gt'],其中'lt'表示线向区域为1，线下区域为-1;'gt'表示意义相反
              
    '''
    
    retArray = np.ones((np.shape(dataMatrix)[0],1))# 创建一个与dataMartix大小一样，但是值全一的矩阵
    if threshIneq == 'lt': #改变超出阈值的值为-1
        retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
    else:
        retArray[dataMatrix[:,dimen] > threshVal] = -1.0
    return retArray
def buildStump(dataArr, classLabels,D):
    '''
    创建单层决策树
    dataArr:要判断的数据点
    classLabels:每一个数据点对应的值
    D：数据的权重向量
    
    '''
    dataMatrix = np.mat(dataArr) ; labelMat = np.mat(classLabels).T
    m,n = np.shape(dataMatrix)
    numSteps = 10.0; bestStump = {}; bestClasEst = np.mat(np.zeros((m,1)))
    minError = float('inf')
    for i in range(n):  # 第一层循环，在数据集的所有特征上遍历
        rangeMin = dataMatrix[:,i].min()
        rangeMax = dataMatrix[:,i].max()
        stepSize = (rangeMax - rangeMin) / numSteps
        for j in range(-1,int(numSteps)+1):  # 通过最大值和最小值我们可以计算得到步长，第二层循环在这些步长上遍历。
            for inequal in ['lt','gt']:  # 第三层循环在大于小于切换不等式
                threshVal = (rangeMin + float(j)*stepSize)
                predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)
#                print(predictedVals)
                errArr = np.mat(np.ones((m,1)))
                errArr[predictedVals == labelMat] = 0
#                print(errArr)
                weightedError = D.T*errArr
#                print(weightedError)
#                print("split: dim %d, thresh %.2f, thresh inequal: %s,\
#                      the weighted error is %.3f" % (i,threshVal,inequal, weightedError))
                if weightedError < minError:
                    minError = weightedError
                    bestClasEst = predictedVals.copy()
                    bestStump['dim'] = i
                    bestStump['thresh'] = threshVal
                    bestStump['ineq'] = inequal
    return bestStump, minError , bestClasEst

3.基于单层决策树的AdaBoost的训练过程

def adaBoostTrainDS(dataArr, classLabels, numIt = 40):
    '''
    dataArr:训练数据集
    classLabels:数据的标签
    numIt：迭代次数
    '''
    weakClassArr = []  # 弱分类器数组
    m = np.shape(dataArr)[0]
    D = np.mat(np.ones((m,1))/m)
    aggClassEst = np.mat(np.zeros((m,1)))
    for i in range(numIt):
        bestStump, error ,classEst = buildStump(dataArr, classLabels, D)
        print("D:",D.T)
        #计算弱分类器的系数
        alpha = float(0.5 * np.log((1.0 - error)/ max(error,1e-16)))
        bestStump['alpha'] = alpha
        weakClassArr.append(bestStump)
        print("classEst:",classEst.T)
        #为下一次迭代计算D,D为分类数据的权重
        expon = np.multiply(-1*alpha*np.mat(classLabels).T,classEst)
        D = np.multiply(D,np.exp(expon))
        D = D/D.sum()
        aggClassEst += alpha * classEst
        print("aggClassEst:",aggClassEst.T)
        # 计算错误率
        aggErrors = np.multiply(np.sign(aggClassEst) != np.mat(classLabels).T, np.ones((m,1)))
        errorRate = aggErrors.sum() /m
        print("total error:",errorRate,"\n")
        if errorRate == 0.0:
            break
    return weakClassArr

# 经过三次迭代得到最优结果
classifierArr = adaBoostTrainDS(datMat, classLabels)

D: [[ 0.2  0.2  0.2  0.2  0.2]]
classEst: [[-1.  1. -1. -1.  1.]]
aggClassEst: [[-0.69314718  0.69314718 -0.69314718 -0.69314718  0.69314718]]
total error: 0.2 

D: [[ 0.5    0.125  0.125  0.125  0.125]]
classEst: [[ 1.  1. -1. -1. -1.]]
aggClassEst: [[ 0.27980789  1.66610226 -1.66610226 -1.66610226 -0.27980789]]
total error: 0.2 

D: [[ 0.28571429  0.07142857  0.07142857  0.07142857  0.5       ]]
classEst: [[ 1.  1.  1.  1.  1.]]
aggClassEst: [[ 1.17568763  2.56198199 -0.77022252 -0.77022252  0.61607184]]
total error: 0.0