Machine Learning with Scikit-Learn and Tensorflow 6.2 进行预测

本文介绍了一种使用决策树进行鸢尾花分类的方法,并详细解析了决策树的工作原理及其如何通过特征值来进行分类判断。文中还提供了绘制决策边界的示例代码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

书籍信息
Hands-On Machine Learning with Scikit-Learn and Tensorflow
出版社: O’Reilly Media, Inc, USA
平装: 566页
语种: 英语
ISBN: 1491962291
条形码: 9781491962299
商品尺寸: 18 x 2.9 x 23.3 cm
ASIN: 1491962291

系列博文为书籍中文翻译
代码以及数据下载:https://github.com/ageron/handson-ml

此为6.1得到的决策树

这里写图片描述

分类新的iris flower时,我们从根结点开始。根结点询问花朵的petal length,如果petal length<=2.45厘米,那么我们进入根结点左边的子结点。注意到当前结点是叶子结点,所以我们得到花朵的分类是setosa。如果petal length>2.45厘米,那么我们进入根结点右边的子结点。注意到当前结点不是叶子结点,询问花朵的petal width,如果petal width<=1.75厘米,那么我们得到花朵的分类是versicolor,否则,我们得到花朵的分类是virginica。

注释:
决策树不需要特别的数据预处理,例如数据放缩(feature scaling),数据中心化(feature centering)。

结点的属性的包括当前结点的样本数量。例如,100个训练数据的petal length>2.45厘米,在这些训练数据中,54个训练数据的petal width<=1.75厘米。结点的属性同时包括所有类别的样本数量。以右下角的结点为例,有0个训练数据是setosa,1个训练数据是versicolor,45个训练数据是virginica。基于所有类别的样本数量,我们可以计算结点的基尼不纯度。
基尼不纯度计算公式: Gi=1nk=1p2i,k
其中 pi,k 是结点 i 样本中类别k样本的比例
以左下角的结点为例: 1(054)2(4954)2(554)20.168 ,和图中结果相同。
如果结点只含有特定类别的样本,那么基尼不纯度是0。

注释:
scikit-learn使用CART算法训练决策树,训练的决策树是二叉树。有些决策树训练算法(例如ID3算法)得到的决策树可能是多叉树。

我们可以绘制决策树的决策边界。

from matplotlib.colors import ListedColormap
plt.figure(figsize=(9, 4))
x1s = np.linspace(0, 7.5, 100)
x2s = np.linspace(0, 3, 100)
x1, x2 = np.meshgrid(x1s, x2s)
X_new = np.c_[x1.ravel(), x2.ravel()]
y_pred = tree_clf.predict(X_new).reshape(x1.shape)
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
plt.axis([0, 7.5, 0, 3])
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.plot([2.45, 2.45], [0, 3], "k-", linewidth=2)
plt.plot([2.45, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.plot([4.95, 4.95], [0, 1.75], "k:", linewidth=2)
plt.plot([4.85, 4.85], [1.75, 3], "k:", linewidth=2)
plt.text(1.40, 1.0, "Depth=0", fontsize=15)
plt.text(3.2, 1.80, "Depth=1", fontsize=13)
plt.text(4.05, 0.5, "(Depth=2)", fontsize=11)
plt.show()

译者注:
绘制决策边界参考资料:http://blog.youkuaiyun.com/qinhanmin2010/article/details/65692760

这里写图片描述

左边的竖线是根结点的决策边界:petal length=2.45厘米。左边的数据属于相同类别,不需继续划分。右边的数据属于不同类别,需要继续划分。横线是根结点右边的子结点的决策边界:petal width=1.75厘米。注意到我们设置max_depth=2,所以决策树的训练至此停止。如果我们设置max_depth=3,那么第2层的2个结点会分别拥有各自的决策边界,即右边的竖线(竖线的上半部分和下半部分是有差异的)。

注释:
white box models:预测过程容易解释,例如决策树。
black box models:预测过程解释困难,例如随机森林,神经网络。

When most people hear “Machine Learning,” they picture a robot: a dependable butler or a deadly Terminator depending on who you ask. But Machine Learning is not just a futuristic fantasy, it’s already here. In fact, it has been around for decades in some specialized applications, such as Optical Character Recognition (OCR). But the first ML application that really became mainstream, improving the lives of hundreds of millions of people, took over the world back in the 1990s: it was the spam filter. Not exactly a self-aware Skynet, but it does technically qualify as Machine Learning (it has actually learned so well that you seldom need to flag an email as spam anymore). It was followed by hundreds of ML applications that now quietly power hundreds of products and features that you use regularly, from better recommendations to voice search. Where does Machine Learning start and where does it end? What exactly does it mean for a machine to learn something? If I download a copy of Wikipedia, has my computer really “learned” something? Is it suddenly smarter? In this chapter we will start by clarifying what Machine Learning is and why you may want to use it. Then, before we set out to explore the Machine Learning continent, we will take a look at the map and learn about the main regions and the most notable landmarks: supervised versus unsupervised learning, online versus batch learning, instance-based versus model-based learning. Then we will look at the workflow of a typical ML project, discuss the main challenges you may face, and cover how to evaluate and fine-tune a Machine Learning system. This chapter introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart. It will be a high-level overview (the only chapter without much code), all rather simple, but you should make sure everything is crystal-clear to you before continuing to the rest of the book. So grab a coffee and let’s get started!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值