Machine Learning with Scikit-Learn and Tensorflow 7.2 Bagging和Pasting

最新推荐文章于 2023-11-27 15:15:08 发布

翻译最新推荐文章于 2023-11-27 15:15:08 发布 · 1.2k 阅读

机器学习专栏收录该内容

33 篇文章

订阅专栏

本文介绍了集成学习中Bagging和Pasting的概念，并通过实例展示了如何使用BaggingClassifier提高决策树模型的性能。实验结果显示，Bagging能有效降低模型方差，提高泛化能力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

书籍信息
Hands-On Machine Learning with Scikit-Learn and Tensorflow
出版社: O’Reilly Media, Inc, USA
平装: 566页
语种: 英语
ISBN: 1491962291
条形码: 9781491962299
商品尺寸: 18 x 2.9 x 23.3 cm
ASIN: 1491962291

系列博文为书籍中文翻译
代码以及数据下载：https://github.com/ageron/handson-ml

获得多样分类器的一种方法是使用不同的算法，另一种方法是使用相同的算法，但是在训练数据的不同子集上进行训练。如果针对训练数据的抽样是有放回的抽样，那么这样的方法是bagging（bootstrap aggregation的缩写），如果针对训练数据的抽样是没有放回的抽样，那么这样的方法是pasting。换句话说，bagging和pasting都允许特定的训练数据多次出现在不同的训练过程，但是只有bagging允许特定的训练数据多次出现在相同的训练过程。

当所有的模型训练完成后，最后的结果通过最多模型输出的结果（分类问题）或模型输出结果的平均值（回归问题）得到。相比于使用完整的训练数据进行训练，使用部分训练数据训练得到的模型具有较高的偏差，但是，集成学习可以同时降低偏差和方差。一般来说，相比单独的模型在完整的训练数据上得到的结果，集成学习得到的结果具有类似的偏差和较低的方差。

bagging和pasting流行的原因包括他们可以并行地在多核或多台服务器进行训练和预测。

scikit-learn通过BaggingClassifier（分类问题）/BaggingRegressor（回归问题）实现bagging和pasting（默认实现bagging，实现pasting需要设置bootstrap=False）。下面的例子利用bagging的思想每次抽样100个训练数据训练500棵决策树（n_jobs=-1表示利用所有可用的核）。

from sklearn.datasets import make_moons
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
bag_clf = BaggingClassifier(
        DecisionTreeClassifier(random_state=42), n_estimators=500,
        max_samples=100, bootstrap=True, n_jobs=-1, random_state=42
    )
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print accuracy_score(y_test, y_pred)
# output
# 0.904

注释：
如果基础分类器能够预测类别的概率，BaggingClassifier会自动使用soft voting的思想。

下面的实例比较决策树的决策边界和基于决策树的BaggingClassifier的决策边界。虽然两种模型在训练数据上的结果类似（事实上，决策树因为过拟合，效果更加优秀），但是BaggingClassifier的决策边界更加平滑，在测试数据上的结果更加优秀。说明集成学习得到的结果具有类似的偏差和较低的方差。

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print accuracy_score(y_test, y_pred_tree)
# output
# 0.856

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

plt.figure(figsize=(11,4))
plt.subplot(121)
plot_decision_boundary(tree_clf, X, y)
plt.title("Decision Tree", fontsize=14)
plt.subplot(122)
plot_decision_boundary(bag_clf, X, y)
plt.title("Decision Trees with Bagging", fontsize=14)
plt.show()