机器学习之有监督学习实例_01 —— 鸢尾花数据集的分类

最新推荐文章于 2024-04-17 15:13:27 发布

Edword_adra

最新推荐文章于 2024-04-17 15:13:27 发布

阅读量3.2k

点赞数 6

分类专栏：机器学习入门实践数据分析文章标签：机器学习 python 数据分析

本文链接：https://blog.youkuaiyun.com/Edword_adra/article/details/107307772

版权

数据分析同时被 2 个专栏收录

4 篇文章

订阅专栏

机器学习入门实践

3 篇文章

订阅专栏

Iris 数据集简介：

Iris Data Set（鸢尾属植物数据集）首次出现在著名的英国统计学家和生物学家Ronald Fisher 1936年的论文《The use of multiple measurements in taxonomic problems》中，被用来介绍线性判别式分析。该数据集种包括三类不同的鸢尾属植物：Iris Setosa，Iris Versicolour，Iris Virginica；且每类收集了50个样本，故这个数据集一共包含了150个样本。

**特征：**该数据集种包括150个样本的4个特征（单位： $c m$ ） : $s p e a l$ $l e n t h (花萼长度)$ 、 $s e p a l$ $w i d t h (花萼宽度)$ 、 $p e t a l$ $l e n t h (花瓣长度)$ 、 $p e t a l$ $w i d t h (花瓣宽度)$ ，通常用 $m$ 来表示样本量的大小， $n$ 表示每个样本所具有的特征数，即 $m = 150 、 n = 4$ ;

1. 从 $s k l e a r n$ 中导入数据集

from sklearn import datasets
iris = datasets.load_iris()

将鸢尾花卉数据集中所有的数据和元数据都加载到 $i r i s$ 变量中，使用 $i r i s$ 变量的 $d a t a$ 、 $t a r g e t$ 属性

iris.data #查看其包含的数据

输出结果为包含 $150$ 个元素的数组，每个元素包含四个数值：分别为萼片和花瓣的数据

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       ···

查看 $150$ 个数据集的种类，包含三种：山鸢尾、变色鸢尾和维吉尼亚鸢尾

iris.target

输出结果包含 150 个数值，其中取值为：0、1 和 2分别代表山鸢尾、变色鸢尾和维吉尼亚鸢尾。

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

三种不同的鸢尾花卉的类别为

iris.target_names #访问iris的target_names属性

输出结果为鸢尾花的三个类别：

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

2. 可视化鸢尾花数据集

使用 $m a t p l o t l i b 库$ ，用三种颜色来表示三种花卉的种类，绘制一幅散点图；其中蓝、绿和红分别代表山鸢尾、变色鸢尾和维吉尼亚鸢尾， $x$ 轴表示萼片的长度， $y$ 轴表示萼片的宽度，。

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data[:,0] #x_Axis sepal length
y = iris.data[:,1] #y_Axis sepal length
species = iris.target
x_min,x_max = x.min() - 0.5,x.max() + 0.5
y_min,y_max = y.min() - 0.5,y.max() + 0.5

plt.figure()
plt.title('Iris Dataset - Classification By Spepal Sizes ')
plt.scatter(x,y,c=species)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min,x_max)
plt.ylim(y_min,x_max)
plt.xticks(())
plt.yticks(())

用不同颜色表示不同鸢尾花卉种类
对代码进行修改，用花瓣的长和宽两个变量来绘制图表：

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data[:,2] #x_label
y = iris.data[:,3] #y_label
species = iris.target
x_min,x_max = x.min() - .5,x.max() + .5
y_min,y_max = y.min() - .5,y.max() + .5

plt.figure()
plt.title('Iris Dataset - Classification By Petal Sizes ',size=14)
plt.scatter(x,y,c=species)
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.xlim(x_min,x_max)
plt.ylim(y_min,x_max)
plt.xticks(())
plt.yticks(())

3. 主成分分析法 $P C A$

主成分分析法： $P r i n c i p a l$ $C o m p o n e n t$ $A n a l i s i s$ ，特点：该方法可以减少系统的维数，保留足以描述各数据点特征的信息，其中新生成的各维称作主成分。此处的应用是将花瓣、萼片的四个测量数据来描述三种花卉的特点，把四个测量数据整合到一起 —— 3D散点图。

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA

iris = datasets.load_iris()
x = iris.data[:,1] #X_Axis-petal length
y = iris.data[:,2] #Y_Axis_petal length
species = iris.target
x_reduced = PCA(n_components=3).fit_transform(iris.data)
# SCATTERPLOT 3D
fig = plt.figure()
ax = Axes3D(fig)
ax.set_title('Iris Dataset by PCA',size =14)
ax.scatter(x_reduced[:,0],x_reduced[:,1],x_reduced[:,2],c=species)

ax.set_xlabel('First eigenvector')
ax.set_ylabel('Second eigenvetor')
ax.set_zlabel('Third eigenvector')
ax.w_xaxis.set_ticklabels(())
ax.w_yaxis.set_ticklabels(())
ax.w_zaxis.set_ticklabels(())

如下图，三种鸢尾花卉被 $3 D$ 散点图表示出来，各自形成一簇。
在这里插入图片描述

鸢尾花数据集分类

在模型测试过程这一阶段，我们会验证用先前采集的数据创建的模型是否有效。初始采集的数据会被分为训练集和检验集，用于建模的数据称为训练集，用来验证模型的数据称为验证集。其中最著名的是交叉检验，基础操作是把训练集分为不同的部分，每一部分轮流作为验证集，同时其余部分用作训练集，通过这种迭代的方式，进而得到最佳模型。

K - 近邻分类器

1. 将数据分为训练集、验证集，其中 140 个数据用于模型的训练，10 个数据作为验证集。

# K-近邻分类器
import numpy as np
from sklearn import datasets
np.random.seed(0)
iris = datasets.load_iris()
x = iris.data
y = iris.target
i = np.random.permutation(len(iris.data)) #先打乱数组各元素的顺序，再进行切分
#训练数据集
x_train = x[i[:-10]]
y_train = y[i[:-10]]
#验证数据集
x_test = x[i[-10:]]
y_test = y[i[-10:]]

2. 调用 $K N N$ 分类器的构造函数，然后用 $f i t ()$ 函数对其进行训练，用140条观测数据训练 $k n n$ 分类器，得到预测模型。

In[ ]:	from sklearn.neighbors import KNeighborsClassifier
  		knn = KNeighborsClassifier()
		knn.fit(x_train,y_train) 
		#调用用分类器的构造函数，用fit()函数对其进行训练
Out[ ]: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

3. 用验证集验证模型的效果，要获取预测结果，直接在预测模型 $k n n$ 上调用 $p r e d i c t ()$ 函数。

In[1]:  knn.predict(x_test)
Out[1]: array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])
 #将预测结果与y_test 中的实际值进行比较
In[2]:y_test 
Out[2]: array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

由上面可知，错误率为 $10$ %，为了更加直观看懂决策结果，我们分别画出萼片测量数据、花瓣测量数据的决策边界。

萼片测量数据决策边界

#画出决策边界
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
x = iris.data[:,:2]
y = iris.target
x_min,x_max = x[:,0].min()- .5,x[:,0].max() + .5
y_min,y_max = x[:,1].min() - .5,x[:,1].max() + .5

#MESH
#用三种不同颜色表示的三个决策边界
cmap_light = ListedColormap(['#AAAAFF','#AAFFAA','#FFAAAA'])
h = 0.02 
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
knn = KNeighborsClassifier()
knn.fit(x,y)
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx,yy,Z,cmap=cmap_light)
# plot the training data
plt.scatter(x[:,0],x[:,1],c=y)
plt.title('Sepal Length Decision Boundary ',size=14)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())

在这里插入图片描述
我们可以清晰的从散点图中看出，红色部分有一小块区域伸入到其他决策边界之中，即编号为1、2的变色鸢尾和维吉尼亚鸢尾存在交集，导致训练后得到模型在预测中存在偏差，也就是我们在预测 $x$ $t e s t$ 得到的预测结果与验证集中 $y$ $t e s t$ 存在的 $10$ % 错误。

花瓣测量数据决策边界

#画出决策边界
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
x = iris.data[:,2:4]
y = iris.target

x_min,x_max = x[:,0].min()- .5,x[:,0].max() + .5
y_min,y_max = x[:,1].min() - .5,x[:,1].max() + .5
#MESH
#用三种不同颜色表示的三个决策边界
cmap_light = ListedColormap(['#AAAAFF','#AAFFAA','#FFAAAA'])
h = 0.02 
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
knn = KNeighborsClassifier()
knn.fit(x,y)
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx,yy,Z,cmap=cmap_light)

# plot the training data
plt.scatter(x[:,0],x[:,1],c=y)
plt.title('Petal Length Decision Boundary ',size=14)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())

在这里插入图片描述
同样，我们可以从散点图中看出，红色部分与绿色部分存在交叉部分，即编号为1、2的变色鸢尾 $(G r e e n)$ 和维吉尼亚鸢尾 $(R e d)$ 存在交叉部分，导致训练得到的模型在预测中存在偏差，也就解释了我们在对比 $x$ $t e s t$ 得到的预测结果与验证集中 $y$ $t e s t$ 存在的预测错误。