Machine Learning with Scikit-Learn and Tensorflow 7.2 Bagging和Pasting

书籍信息
Hands-On Machine Learning with Scikit-Learn and Tensorflow
出版社: O’Reilly Media, Inc, USA
平装: 566页
语种: 英语
ISBN: 1491962291
条形码: 9781491962299
商品尺寸: 18 x 2.9 x 23.3 cm
ASIN: 1491962291

系列博文为书籍中文翻译
代码以及数据下载:https://github.com/ageron/handson-ml

获得多样分类器的一种方法是使用不同的算法,另一种方法是使用相同的算法,但是在训练数据的不同子集上进行训练。如果针对训练数据的抽样是有放回的抽样,那么这样的方法是bagging(bootstrap aggregation的缩写),如果针对训练数据的抽样是没有放回的抽样,那么这样的方法是pasting。换句话说,bagging和pasting都允许特定的训练数据多次出现在不同的训练过程,但是只有bagging允许特定的训练数据多次出现在相同的训练过程。

当所有的模型训练完成后,最后的结果通过最多模型输出的结果(分类问题)或模型输出结果的平均值(回归问题)得到。相比于使用完整的训练数据进行训练,使用部分训练数据训练得到的模型具有较高的偏差,但是,集成学习可以同时降低偏差和方差。一般来说,相比单独的模型在完整的训练数据上得到的结果,集成学习得到的结果具有类似的偏差和较低的方差。

bagging和pasting流行的原因包括他们可以并行地在多核或多台服务器进行训练和预测。

scikit-learn通过BaggingClassifier(分类问题)/BaggingRegressor(回归问题)实现bagging和pasting(默认实现bagging,实现pasting需要设置bootstrap=False)。下面的例子利用bagging的思想每次抽样100个训练数据训练500棵决策树(n_jobs=-1表示利用所有可用的核)。

from sklearn.datasets import make_moons
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
bag_clf = BaggingClassifier(
        DecisionTreeClassifier(random_state=42), n_estimators=500,
        max_samples=100, bootstrap=True, n_jobs=-1, random_state=42
    )
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print accuracy_score(y_test, y_pred)
# output
# 0.904

注释:
如果基础分类器能够预测类别的概率,BaggingClassifier会自动使用soft voting的思想。

下面的实例比较决策树的决策边界和基于决策树的BaggingClassifier的决策边界。虽然两种模型在训练数据上的结果类似(事实上,决策树因为过拟合,效果更加优秀),但是BaggingClassifier的决策边界更加平滑,在测试数据上的结果更加优秀。说明集成学习得到的结果具有类似的偏差和较低的方差。

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print accuracy_score(y_test, y_pred_tree)
# output
# 0.856

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.5, -1, 1.5], alpha=0.5):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)

plt.figure(figsize=(11,4))
plt.subplot(121)
plot_decision_boundary(tree_clf, X, y)
plt.title("Decision Tree", fontsize=14)
plt.subplot(122)
plot_decision_boundary(bag_clf, X, y)
plt.title("Decision Trees with Bagging", fontsize=14)
plt.show()

这里写图片描述

译者注:
绘制决策边界参考资料:http://blog.youkuaiyun.com/qinhanmin2010/article/details/65692760

相比pasting,bagging有放回的抽样虽然会略微提高偏差,但是能够提高模型的多样性,降低方差。总的来说,bagging通常能够取得更好的效果,人们通常选择使用bagging。但是,如果时间允许,可以考虑比较bagging和pasting的结果。

When most people hear “Machine Learning,” they picture a robot: a dependable butler or a deadly Terminator depending on who you ask. But Machine Learning is not just a futuristic fantasy, it’s already here. In fact, it has been around for decades in some specialized applications, such as Optical Character Recognition (OCR). But the first ML application that really became mainstream, improving the lives of hundreds of millions of people, took over the world back in the 1990s: it was the spam filter. Not exactly a self-aware Skynet, but it does technically qualify as Machine Learning (it has actually learned so well that you seldom need to flag an email as spam anymore). It was followed by hundreds of ML applications that now quietly power hundreds of products and features that you use regularly, from better recommendations to voice search. Where does Machine Learning start and where does it end? What exactly does it mean for a machine to learn something? If I download a copy of Wikipedia, has my computer really “learned” something? Is it suddenly smarter? In this chapter we will start by clarifying what Machine Learning is and why you may want to use it. Then, before we set out to explore the Machine Learning continent, we will take a look at the map and learn about the main regions and the most notable landmarks: supervised versus unsupervised learning, online versus batch learning, instance-based versus model-based learning. Then we will look at the workflow of a typical ML project, discuss the main challenges you may face, and cover how to evaluate and fine-tune a Machine Learning system. This chapter introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart. It will be a high-level overview (the only chapter without much code), all rather simple, but you should make sure everything is crystal-clear to you before continuing to the rest of the book. So grab a coffee and let’s get started!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值