机器学习第八章-集成学习-优快云博客

Boosting 是一族可将弱学习器提升为强学习器的算法.:先从初始训练集训练出一个基学习器，再根据基学习器的表现对训练样本分布进行调整，使得先前基学习器做错的训练样本在后续受到更多关注，然后基于调整后的样本分布来训练下一个基学习器;如此重复进行，直至基学习器数目达到事先指定的值 T最终将这T个基学习器进行加权结合.

AdaBoost 算法是Bossting算法中最具代表性的，基学习器的线性组合是其比较容易理解的推导方式：

$H(\boldsymbol{x})=\sum_{t=1}^{T} \alpha_{t} h_{t}(\boldsymbol{x})$

下面是一个简单的示例代码，演示如何使用 Adaboost 算法来分类一个数据集：

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 生成一个示例数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 初始化 Adaboost 分类器
adaboost_clf = AdaBoostClassifier(n_estimators=50, random_state=42)

# 训练 Adaboost 分类器
adaboost_clf.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = adaboost_clf.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Adaboost Classifier: {accuracy}")

8.3Bagging与随机森林

欲得到泛化性能强的集成，集成中的个体学习器应尽可能相互独立。可以使基学习器尽可能具有较大的差异。为了确保产生出比较好的基学习器，我们可考虑使用相互有交叠的采样子集.

8.3.1Bagging

Bagging 是并行式集成学习方法最著名的代表.Bagging 的算法描述如下图所示：

假定基学习器的计算复杂度为 O(m) Bagging的复杂度大致为 T (0 (m) + 0 (s))。说明 Bagging 是一个很高效的集成学习算法。

8.3.2随机森林（RF）

随机森林是Bagging的一个扩展变体。它进一步在决策树的训练过程中引入了随机属性选择.传统决策树在选择划分属性时是在当前结点的属性集合（假定有d个属性）中选择一个最优属性；而在随机森林中，对基决策树的每个结点，先从该结点的属性集合中随机选择一个包含k个属性的子集，然后再从这个子集中选择一个最优属性用于划分，一般情况下，推荐值 $k=log_{2}d$ .

随机森林简单、容易实现、计算开销小。但随机森林的起始性能往往相对较差，特别是在集成中只包含一个基学习器时这很容易理解，因为通过引入属性扰动，随机森林中个体学习器的性能往往有所降低。

如下代码是使用随机森林算法进行分类任务：

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 生成一个示例数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 初始化随机森林分类器
random_forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# 训练随机森林分类器
random_forest_clf.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = random_forest_clf.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Random Forest Classifier: {accuracy}")

实验结果如下：

8.4结合策略

学习器结合可能会从三个方面带来好处：

1. 结合多个学习器可以减小因误选致泛化性能不佳的风险。

2. 通过多次运行之后进行结合，降低陷入糟糕局部极小点的风险。

3. 通过结合多个学习器由于相应的假设空间有所扩大，能学得更好的近似。

8.4.1平均法

简单平均法： $H(\boldsymbol{x})=\frac{1}{T} \sum_{i=1}^{T} h_{i}(\boldsymbol{x})$

加权平均法： $H(\boldsymbol{x})=\sum_{i=1}^{T} w_{i} h_{i}(\boldsymbol{x})$

8.4.2投票法

绝对多数投票法： $H(\boldsymbol{x})=\left\{\begin{array}{ll} c_{j}, & \text { if } \sum_{i=1}^{T} h_{i}^{j}(\boldsymbol{x})>0.5 \sum_{k=1}^{N} \sum_{i=1}^{T} h_{i}^{k}(\boldsymbol{x}) ; \\ \text { reject, } & \text { otherwise. } \end{array}\right.$

即若某标记得票过半数，则预测为该标记;否则拒绝预测.

相对多数投票法： $H(\boldsymbol{x})=c_{\underset{j}{\arg \max } \sum_{i=1}^{T} h_{i}^{j}(\boldsymbol{x})} .$

即预测为得票最多的标记，若同时有多个标记获最高票，则从中随机选取一个。

加权投票法： $H(\boldsymbol{x})=c_{j}^{\arg \max } \sum_{i=1}^{T} w_{i} h_{i}^{j}(\boldsymbol{x}) .$

8.4.3学习法

当训练数据很多时，一种更为强大的结合策略是使用"学习法"。Stacking算法是典型代表。

8.5多样性

8.5.1误差-分歧分解

对示例x，定义学习器 $h_{i}$ 的"分歧" 为: $A\left(h_{i} \mid \boldsymbol{x}\right)=\left(h_{i}(\boldsymbol{x})-H(\boldsymbol{x})\right)^{2},$

则集体的分歧为：

$\begin{aligned} \bar{A}(h \mid \boldsymbol{x}) & =\sum_{i=1}^{T} w_{i} A\left(h_{i} \mid \boldsymbol{x}\right) \\ & =\sum_{i=1}^{T} w_{i}\left(h_{i}(\boldsymbol{x})-H(\boldsymbol{x})\right)^{2} \end{aligned}$

类似的，个体学习器 $h_{i}$ 在全样本上的泛化误差和分歧项分别为：

$\begin{aligned} E_{i} & =\int E\left(h_{i} \mid \boldsymbol{x}\right) p(\boldsymbol{x}) d \boldsymbol{x}, \\ A_{i} & =\int A\left(h_{i} \mid \boldsymbol{x}\right) p(\boldsymbol{x}) d \boldsymbol{x} . \end{aligned}$

集成的泛化误差为：

$E=\int E(H \mid \boldsymbol{x}) p(\boldsymbol{x}) d \boldsymbol{x}$

下面实验展示了如何用Python进行类似EDDA的误差分析过程，以便理解模型的预测误差来源：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# 生成一个简单的线性回归数据集
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 训练线性回归模型
model = LinearRegression()
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)

# 计算预测误差
errors = np.abs(y_pred - y_test)

# 打印误差统计信息
print(f"Mean absolute error: {np.mean(errors)}")
print(f"Max absolute error: {np.max(errors)}")

# 绘制误差分布图
plt.figure(figsize=(10, 6))
plt.scatter(X_test, errors, color='blue', alpha=0.6)
plt.title('Error Distribution')
plt.xlabel('X_test')
plt.ylabel('Absolute Error')
plt.show()

实验结果如下：

8.5.2多样性度量

多样性度量是用于度量集成中个体分类器的多样性，即估算个体学习器的多样化程度.

典型做法是考虑个体分类器的两两相似/不相似性.结果预测表如下：

下面实验是展示计算集成模型中多个分类器的投票多样性：

import numpy as np

# 假设我们有三个分类器的预测结果
classifier1_predictions = np.array([1, 0, 1, 1, 0])
classifier2_predictions = np.array([0, 0, 1, 1, 1])
classifier3_predictions = np.array([1, 1, 0, 0, 1])

# 计算每个分类器的预测一致性
def pairwise_disagreement(predictions1, predictions2):
    return np.mean(predictions1 != predictions2)

# 计算投票多样性
def ensemble_diversity(predictions):
    n_classifiers = len(predictions)
    diversity_sum = 0.0
    count = 0

    for i in range(n_classifiers):
        for j in range(i + 1, n_classifiers):
            diversity_sum += pairwise_disagreement(predictions[i], predictions[j])
            count += 1

    return diversity_sum / count

# 组合预测结果
predictions = [classifier1_predictions, classifier2_predictions, classifier3_predictions]

# 计算集成模型的投票多样性
diversity = ensemble_diversity(predictions)

print(f"Ensemble diversity: {diversity}")

实验结果如下：

8.5.3多样性增强

常见做法主要对数据样本、输入属性、输出表示，算法参数进行扰动

数据样本扰动给定初始数据集,可从中产生出不同的数据子集,再利用不同的数据子集训练出不同的个体学习器．数据样本扰动通常是基于采样法

输入属性扰动有著名的子空间算法，其从初始属性集中抽取出若干个属性子集，再基于每个属性子集训练一个基学习器。

输出表示扰动的基本思路是对输出表示进行操纵以增强多样性，可以对训练样本的类标记进行扰动，比如翻转法；可以对输出表示进行转化，如输出调制法。

算法参数扰动基学习算法一般都有参数需进行设置，例如神经网络的隐层神经元数、初始连接权值等，通过随机设置不同的参数，往往可产生差别较大的个体学习器。

下面实验是一个简单的图像分类任务中的数据增强示例代码，使用了Python的tensorflow库和ImageDataGenerator类：

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt

# 加载数据集（假设是CIFAR-10）
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.cifar10.load_data()

# 数据增强设置
datagen = ImageDataGenerator(
    rotation_range=20,      # 随机旋转角度范围
    width_shift_range=0.2,  # 水平随机平移范围
    height_shift_range=0.2, # 垂直随机平移范围
    horizontal_flip=True,   # 水平翻转
    zoom_range=0.1          # 随机缩放范围
)

# 选择一个示例图片进行数据增强展示
img = train_images[0]
img = img.reshape((1,) + img.shape)  # 为了适应flow方法的输入需求

# 生成增强后的图像批次
augmented_images = datagen.flow(img, batch_size=1)

# 显示原始图像和增强后的图像
plt.figure(figsize=(10, 10))

# 显示原始图像
plt.subplot(1, 2, 1)
plt.imshow(img[0])
plt.title('Original Image')

# 显示增强后的图像
plt.subplot(1, 2, 2)
for i in range(9):
    augmented_image = augmented_images.next()[0].astype('uint8')
    plt.imshow(augmented_image)
    plt.title('Augmented Image')
    plt.axis('off')
    plt.tight_layout()

plt.show()

下面是实验结果：