【Python】机器学习笔记02：模型评估与特征工程

本文链接：https://blog.youkuaiyun.com/weixin_41429999/article/details/107944962

本文需要用到的Python包：

%matplotlib inline
import pandas as pd
import seaborn as sns
from matplotlib.pylab import *

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import validation_curve, learning_curve

sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

模型验证

在进行机器学习时，模型验证是很重要的一件事，模型验证指的是对比模型的预测值与实际值的差别，为了能够进行模型验证，我们需要在已有的数据中预留出一部分作为验证集；通常的做法是，把数据分为几部分，每一部分轮流作为验证集，这样的做法称为交叉检验，例如一个五轮的交叉检验过程如下所示：

在这里插入图片描述

sklearn中，cross_val_score实现了交叉检验的流程并可以自动计算模型准确率，使用方法如下（以鸢尾花的贝叶斯分类为例）：

iris = pd.read_csv('./seaborn_dataset/iris.csv')
iris_x = iris.drop('species', axis=1)
iris_y = iris['species']
model = KNeighborsClassifier(n_neighbors=1)

res = cross_val_score(model, iris_x, iris_y, cv=5)
print('五轮交叉检验结果分别为：')
for i, r in enumerate(res):
    print(f'{r*100:0.3f}%', end=(', ' if i + 1 < res.size else '\n'))
print(f'模型准确率的均值为：{res.mean()*100:0.3f}%，方差为：{res.std():0.3f}')

检验结果：

五轮交叉检验结果分别为：
96.667%, 96.667%, 93.333%, 93.333%, 100.000%
模型准确率的均值为：96.000%，方差为：0.025

验证曲线

在进行机器学习的过程中，如果模型的复杂度过低，会出现欠拟合的现象，此时我们认为模型没有能够适应数据的所有特征，模型偏差过高；
如果模型的复杂度过高，会出现过拟合的现象，此时我们认为模型过多的学习了数据的噪音，模型方差过高；
欠拟合与过拟合的示例如下：

在这里插入图片描述

对于欠拟合（高偏差）的模型，模型的表现在验证集上的表现与在训练集上的表现类似；
对于过拟合（高偏差）的模型，模型的表现在验证集上的表现远远不如在训练集上的表现；

sklearn提供一个函数，validation_curve，方便我们绘制模型复杂度与模型效果的关系，以下是一个多项式拟合的例子：

x_train = np.random.rand(100)
y_train = 10 - 1 / (x_train ** 2 + 0.1) + 2 * np.random.rand(100)
x_test = linspace(-0.1, 1.1, 1000)
x_train, y_train, x_test = x_train[:, np.newaxis], y_train[:, np.newaxis], x_test[:, np.newaxis]

plt.figure(figsize=(10, 10))
plt.plot(x_train, y_train, linestyle='', marker='o', markersize=5, alpha=0.8)

for d in [1, 2, 6]:
    lin = LinearRegression()
    pol = PolynomialFeatures(degree=d)
    lin.fit(pol.fit_transform(x_train), y_train)
    res = lin.predict(pol.fit_transform(x_test))
    plt.plot(x_test, res, label=f'拟合阶数：{d}')
plt.title('使用不同阶数的多项式模型拟合曲线')
plt.legend()

model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
train_score, val_score = validation_curve(
    estimator=model,
    X=x_train,
    y=y_train,
    param_name='polynomialfeatures__degree',
    param_range=range(0, 20),
    cv=7,
)

plt.figure(figsize=(10, 10))
plt.plot(range(0, 20), np.mean(train_score, axis=1), label='训练得分')
plt.plot(range(0, 20), np.mean(val_score, axis=1), label='验证得分')
plt.xlabel('多项式阶数')
plt.ylabel('判定系数')
plt.title('模型验证曲线')
plt.legend()

在这里插入图片描述

其中，验证曲线的图像里，Y轴代表的是训练集与验证集分别应用在模型上得到的判定系数（ $R^2$ ），当判定系数=1时，表示模型的输出结果与真实值完全一致，当判定系数<1时，判定系数越小，代表模型的输出结果与真实值的差距越大；

从图中我们可以看到，当模型的复杂度很低时，模型的训练得分与验证得分都很低，这时候模型处于欠拟合的状态；随着模型复杂度的提高，模型的训练得分不断增大趋向于1，但当模型复杂度高于一定水平之后，模型的验证得分迅速下降，此时模型发生了过拟合；
我们需要做的，就是找到训练得分与验证得分达到均衡的点；

学习曲线

在机器学习中，学习曲线指的是，一个复杂度一定的模型的训练得分/验证得分与训练集规模的关系；
对于复杂度一定的模型：
训练集较小时，模型容易过拟合，此时训练的分较高，验证得分较低；
训练集较大时，模型趋向于欠拟合，此时训练得分会降低，但是验证得分会提高；
模型的验证得分永远低于训练得分，随着训练规模的扩大，两者趋向于某一个定值；

在sklearn中，有函数learning_curve，方便我们为某一模型绘制学习曲线；

以下是一个为两个阶数不同的多项式模型绘制学习曲线的例子（这里我实现的效果不是很理想）：

x_train = np.random.rand(150) * 10
y_train = (x_train + 1 * np.random.rand(150)) ** 0.2
x_train, y_train = x_train[:, np.newaxis], y_train[:, np.newaxis]

plt.figure(figsize=(7, 7))
plt.plot(x_train, y_train, linestyle='', marker='o', markersize=2, alpha=0.8)
plt.title('训练数据集')

fig, axs = plt.subplots(1, 2, figsize=(20, 10))
for i, d in enumerate([5, 8]):
    N, train_score, val_score = learning_curve(
        estimator=make_pipeline(PolynomialFeatures(degree=d), LinearRegression()),
        X=x_train,
        y=y_train,
        train_sizes=linspace(0.2, 1, 20),
        cv=7,
    )

    ax = axs[i]  # type: plt.Axes
    ax.plot(N, np.mean(train_score, axis=1), label='训练得分')
    ax.plot(N, np.mean(val_score, axis=1), label='验证得分')
    ax.hlines(np.mean([train_score[-1], val_score[-1]]), N[0], N[-1], linestyle='--', color='gray')
    ax.set_xlabel('训练集大小')
    ax.set_ylabel('判定系数')
    ax.set_title(f'多项式阶数={d}')
    ax.legend()

在这里插入图片描述

网格搜索

由于模型的效果与训练集大小、模型的复杂度都有关，这意味着如果魔门需要寻找最佳的模型参数，从数学角度上讲我们需要在三维曲面上寻找一个最高点，sklearn提供了完成这样任务的工具：GridSearchCV，帮助我们寻找最佳模型参数；

以多项式拟合为例，使用网格搜索工具寻找最佳模型参数：

x_train = np.random.rand(100)
y_train = 10 - 1 / (x_train ** 2 + 0.1) + 2 * np.random.rand(100)
x_train, y_train = x_train[:, np.newaxis], y_train[:, np.newaxis]
x_test = linspace(-0.1, 1.1, 1000)[:, np.newaxis]

grid_param = {
    'polynomialfeatures__degree': np.arange(21),
    'linearregression__fit_intercept': [True, False],
    'linearregression__normalize': [True, False]}
grid = GridSearchCV(make_pipeline(PolynomialFeatures(degree=2), LinearRegression()), grid_param, cv=7)
grid.fit(x_train, y_train)

print('最佳模型参数：')
for k in grid.best_params_.keys():
    print(f'{k:<40} -> {grid.best_params_[k]}')

print(f'\n最佳判定系数：{grid.best_score_}%')

model = grid.best_estimator_
res = model.predict(x_test)

plt.figure(figsize=(10, 10))
plt.plot(x_train, y_train, linestyle='', marker='o', markersize=4, alpha=0.8)
plt.plot(x_test, res)
plt.title('使用网格搜索寻找模型最佳参数并拟合')

在这里插入图片描述

特征工程

现实生活中，我们收集到的原始数据很难直接用于机器学习，需要通过某种方式转化为模型可以接受的数值特征，这就是特征工程；

分类特征

分类特征指的是诸如城市、性别、星期这样的特征，对于分类特征，我们一般使用独热编码，在sklearn中，这项转换工作由DictVectorizer完成：

data = [
    dict(a=300, b=200, c='A'),
    dict(a=500, b=600, c='B'),
    dict(a=100, b=900, c='C'),
]

vec = DictVectorizer(dtype=np.int64, sparse=False)
data = vec.fit_transform(data)

print(data)
print(vec.get_feature_names())

输出结果：

[[300 200   1   0   0]
 [500 600   0   1   0]
 [100 900   0   0   1]]
['a', 'b', 'c=A', 'c=B', 'c=C']

可以看到c这个分类属性被转花纹独热编码后的效果；

由于独热编码会显著增大数据的维度，DictVectorizer类可以返回稀疏矩阵以节约空间；

文本特征

最常见的计算文本特征的方法是计算单词出现次数或者计算词频逆文档频率，相比前者，后者可以防止常用词获得过高的权重；
在sklearn中，统计单词出现次数使用CountVectorizer实现，计算TF-IDF使用TfidfVectorizer实现；

sample = [
    'Beautiful is better than ugly.',
    'Explicit is better than implicit.',
    'Simple is better than complex.',
    'Complex is better than complicated.',
    'Flat is better than nested.',
    'Sparse is better than dense.',
    'Readability counts.',
    'Special cases aren\'t special enough to break the rules.',
    'Although practicality beats purity.',
    'Errors should never pass silently.',
    'Unless explicitly silenced.',
    'In the face of ambiguity, refuse the temptation to guess.',
    'There should be one-- and preferably only one --obvious way to do it.',
    'Although that way may not be obvious at first unless you\'re Dutch.',
    'Now is better than never.',
    'Although never is often better than *right* now.',
    'If the implementation is hard to explain, it\'s a bad idea.',
    'If the implementation is easy to explain, it may be a good idea.',
    'Namespaces are one honking great idea -- let\'s do more of those!',
]

vec = CountVectorizer()
res = vec.fit_transform(sample)
print('CountVectorizer结果：')
print(pd.DataFrame(data=res.toarray(), columns=vec.get_feature_names()).head())

vec = TfidfVectorizer()
res = vec.fit_transform(sample)
print('Tfidfvectorizer结果：')
print(pd.DataFrame(data=res.toarray(), columns=vec.get_feature_names()).head())

输出结果：

CountVectorizer结果：
   although  ambiguity  and  are  aren  at  bad  be  beats  beautiful  ...  \
0         0          0    0    0     0   0    0   0      0          1  ...   
1         0          0    0    0     0   0    0   0      0          0  ...   
2         0          0    0    0     0   0    0   0      0          0  ...   
3         0          0    0    0     0   0    0   0      0          0  ...   
4         0          0    0    0     0   0    0   0      0          0  ...   

   than  that  the  there  those  to  ugly  unless  way  you  
0     1     0    0      0      0   0     1       0    0    0  
1     1     0    0      0      0   0     0       0    0    0  
2     1     0    0      0      0   0     0       0    0    0  
3     1     0    0      0      0   0     0       0    0    0  
4     1     0    0      0      0   0     0       0    0    0  

[5 rows x 79 columns]

Tfidfvectorizer结果：
   although  ambiguity  and  are  aren   at  bad   be  beats  beautiful  ...  \
0       0.0        0.0  0.0  0.0   0.0  0.0  0.0  0.0    0.0   0.594732  ...   
1       0.0        0.0  0.0  0.0   0.0  0.0  0.0  0.0    0.0   0.000000  ...   
2       0.0        0.0  0.0  0.0   0.0  0.0  0.0  0.0    0.0   0.000000  ...   
3       0.0        0.0  0.0  0.0   0.0  0.0  0.0  0.0    0.0   0.000000  ...   
4       0.0        0.0  0.0  0.0   0.0  0.0  0.0  0.0    0.0   0.000000  ...   

       than  that  the  there  those   to      ugly  unless  way  you  
0  0.323877   0.0  0.0    0.0    0.0  0.0  0.594732     0.0  0.0  0.0  
1  0.323877   0.0  0.0    0.0    0.0  0.0  0.000000     0.0  0.0  0.0  
2  0.337944   0.0  0.0    0.0    0.0  0.0  0.000000     0.0  0.0  0.0  
3  0.337944   0.0  0.0    0.0    0.0  0.0  0.000000     0.0  0.0  0.0  
4  0.323877   0.0  0.0    0.0    0.0  0.0  0.000000     0.0  0.0  0.0  

[5 rows x 79 columns]

图像特征

更多关于图像特征提取的工具，在Scikit-Image项目中，这里不做过多介绍（书本作者偷懒了）；

完整代码

#%%

%matplotlib inline
import pandas as pd
import seaborn as sns
from matplotlib.pylab import *

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import validation_curve, learning_curve

sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

#%% md

# 使用sklearn的内置函数交叉验证模型

使用cross_val_score函数，交叉验证鸢尾花数据集，使用K近邻分类器；

#%%

iris = pd.read_csv('./seaborn_dataset/iris.csv')
iris_x = iris.drop('species', axis=1)
iris_y = iris['species']
model = KNeighborsClassifier(n_neighbors=1)

res = cross_val_score(model, iris_x, iris_y, cv=5)
print('五轮交叉检验结果分别为：')
for i, r in enumerate(res):
    print(f'{r*100:0.3f}%', end=(', ' if i + 1 < res.size else '\n'))
print(f'模型准确率的均值为：{res.mean()*100:0.3f}%，方差为：{res.std():0.3f}')

#%% md

# 验证曲线

欠拟合=高偏差，模型没有能够适应数据的所有特征；
过拟合=高方差，模型过多的学习了数据的噪音；

对于高偏差的模型，在训练集的表现与在验证集的表现类似；
对于高方差的模型，在验证集的表现远远不如在训练集的表现；

计算模型的验证曲线举例：使用多项式回归模型；
使用validation_curve函数直接计算验证曲线（模型复杂度与判定系数的关系）；

#%%

model1 = make_pipeline(PolynomialFeatures(1), LinearRegression())
model2 = make_pipeline(PolynomialFeatures(10), LinearRegression())

x = np.random.rand(50)[:, np.newaxis] * 100
y = (x + 5 * np.random.rand(50)[:, np.newaxis]) ** 0.2

x_test = np.linspace(0, 100, 100)[:, np.newaxis]
model1.fit(x, y)
model2.fit(x, y)
res1 = model1.predict(x_test)
res2 = model2.predict(x_test)

fig, axs = plt.subplots(1, 2, figsize=(20, 10))
axs[0].plot(x, y, 'o')
axs[0].plot(x_test, res1)
axs[0].set_title('多项式拟合，阶数1，欠拟合')
axs[1].plot(x, y, 'o')
axs[1].plot(x_test, res2)
axs[1].set_title('多项式拟合，阶数10，过拟合')

#%%

x_train = np.random.rand(100)
y_train = 10 - 1 / (x_train ** 2 + 0.1) + 2 * np.random.rand(100)
x_test = linspace(-0.1, 1.1, 1000)
x_train, y_train, x_test = x_train[:, np.newaxis], y_train[:, np.newaxis], x_test[:, np.newaxis]

plt.figure(figsize=(10, 10))
plt.plot(x_train, y_train, linestyle='', marker='o', markersize=5, alpha=0.8)

for d in [1, 2, 6]:
    lin = LinearRegression()
    pol = PolynomialFeatures(degree=d)
    lin.fit(pol.fit_transform(x_train), y_train)
    res = lin.predict(pol.fit_transform(x_test))
    plt.plot(x_test, res, label=f'拟合阶数：{d}')
plt.title('使用不同阶数的多项式模型拟合曲线')
plt.legend()

model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
train_score, val_score = validation_curve(
    estimator=model,
    X=x_train,
    y=y_train,
    param_name='polynomialfeatures__degree',
    param_range=range(0, 20),
    cv=7,
)

plt.figure(figsize=(10, 10))
plt.plot(range(0, 20), np.mean(train_score, axis=1), label='训练得分')
plt.plot(range(0, 20), np.mean(val_score, axis=1), label='验证得分')
plt.xlabel('多项式阶数')
plt.ylabel('判定系数')
plt.title('模型验证曲线')
plt.legend()

#%% md

# 学习曲线

学习曲线指的是训练集规模与验证得分的曲线；
在模型的复杂度一定时，训练集小了容易过拟合，训练集大了容易欠拟合；
训练的分与验证得分随着训练集的增大会逐渐收敛于某一个值；
在sklearn中计算学习曲线可以直接使用learning_curve函数；

从示例中可以看到，更复杂的模型随着训练集的增大，判定系数最终的收敛值更大。

#%%

x_train = np.random.rand(150) * 10
y_train = (x_train + 1 * np.random.rand(150)) ** 0.2
x_train, y_train = x_train[:, np.newaxis], y_train[:, np.newaxis]

plt.figure(figsize=(7, 7))
plt.plot(x_train, y_train, linestyle='', marker='o', markersize=2, alpha=0.8)
plt.title('训练数据集')

fig, axs = plt.subplots(1, 2, figsize=(20, 10))
for i, d in enumerate([5, 8]):
    N, train_score, val_score = learning_curve(
        estimator=make_pipeline(PolynomialFeatures(degree=d), LinearRegression()),
        X=x_train,
        y=y_train,
        train_sizes=linspace(0.2, 1, 20),
        cv=7,
    )

    ax = axs[i]  # type: plt.Axes
    ax.plot(N, np.mean(train_score, axis=1), label='训练得分')
    ax.plot(N, np.mean(val_score, axis=1), label='验证得分')
    ax.hlines(np.mean([train_score[-1], val_score[-1]]), N[0], N[-1], linestyle='--', color='gray')
    ax.set_xlabel('训练集大小')
    ax.set_ylabel('判定系数')
    ax.set_title(f'多项式阶数={d}')
    ax.legend()

#%% md

# 网格搜索

因为我们需要综合考虑训练集大小、模型复杂度与判定系数的关系，因此为了获得最佳的判定系数，我们需要在一个三维曲面上寻找一个最高点；
sklearn中集成了自动化的寻找参数的工具GridSearchCV；

#%%

x_train = np.random.rand(100)
y_train = 10 - 1 / (x_train ** 2 + 0.1) + 2 * np.random.rand(100)
x_train, y_train = x_train[:, np.newaxis], y_train[:, np.newaxis]
x_test = linspace(-0.1, 1.1, 1000)[:, np.newaxis]

grid_param = {
    'polynomialfeatures__degree': np.arange(21),
    'linearregression__fit_intercept': [True, False],
    'linearregression__normalize': [True, False]}
grid = GridSearchCV(make_pipeline(PolynomialFeatures(degree=2), LinearRegression()), grid_param, cv=7)
grid.fit(x_train, y_train)

print('最佳模型参数：')
for k in grid.best_params_.keys():
    print(f'{k:<40} -> {grid.best_params_[k]}')

print(f'\n最佳判定系数：{grid.best_score_:0.3f}')

model = grid.best_estimator_
res = model.predict(x_test)

plt.figure(figsize=(10, 10))
plt.plot(x_train, y_train, linestyle='', marker='o', markersize=4, alpha=0.8)
plt.plot(x_test, res)
plt.title('使用网格搜索寻找模型最佳参数并拟合')

#%% md

# 数据的分类特征

数据的分类特征一般采用**独热编码**，在sklearn中有DictVectorizer实现；
独热编码的劣势是会显著增大数据的维度，考虑到储存效率，可以使用稀疏矩阵表示；

#%%

data = [
    dict(a=300, b=200, c='A'),
    dict(a=500, b=600, c='B'),
    dict(a=100, b=900, c='C'),
]

vec = DictVectorizer(dtype=np.int64, sparse=False)
data = vec.fit_transform(data)

print(data)
print(vec.get_feature_names())

#%% md

# 文本特征

最常见的计算文本特征的方法是计算单词出现次数或者计算词频逆文档频率，相比前者，后者可以防止常用词获得过高的权重；
在sklearn中，统计单词出现次数使用CountVectorizer实现，计算TF-IDF使用TfidfVectorizer实现；

#%%

sample = [
    'Beautiful is better than ugly.',
    'Explicit is better than implicit.',
    'Simple is better than complex.',
    'Complex is better than complicated.',
    'Flat is better than nested.',
    'Sparse is better than dense.',
    'Readability counts.',
    'Special cases aren\'t special enough to break the rules.',
    'Although practicality beats purity.',
    'Errors should never pass silently.',
    'Unless explicitly silenced.',
    'In the face of ambiguity, refuse the temptation to guess.',
    'There should be one-- and preferably only one --obvious way to do it.',
    'Although that way may not be obvious at first unless you\'re Dutch.',
    'Now is better than never.',
    'Although never is often better than *right* now.',
    'If the implementation is hard to explain, it\'s a bad idea.',
    'If the implementation is easy to explain, it may be a good idea.',
    'Namespaces are one honking great idea -- let\'s do more of those!',
]

vec = CountVectorizer()
res = vec.fit_transform(sample)
print('CountVectorizer结果：')
print(pd.DataFrame(data=res.toarray(), columns=vec.get_feature_names()).head())

vec = TfidfVectorizer()
res = vec.fit_transform(sample)
print('Tfidfvectorizer结果：')
print(pd.DataFrame(data=res.toarray(), columns=vec.get_feature_names()).head())

#%% md

# 图像特征

更多关于图像特征提取的工具，在`Scikit-Image`项目中，这里不做过多介绍；