使用AWS SageMaker进行机器学习项目-优快云博客

本文详细介绍使用Amazon SageMaker进行机器学习项目的全过程，包括数据预处理、特征工程、模型训练及部署等内容，并采用集成学习方法提升预测效果。

使用AWS SageMaker进行机器学习项目

本文主要介绍如何使用AWS SageMaker进行机器学习项目。

1. 题目

使用的题目为阿里天池的“工业蒸汽量预测“，题目地址为：

https://tianchi.aliyun.com/competition/entrance/231693/introduction

给定的数据: 脱敏后的锅炉传感器采集的数据(采集频率为分钟级)

预测目标: 根据锅炉的工况，预测产生的蒸汽量。

数据说明: 数据分成训练数据(train.txt)和测试数据(test.txt)，其中字段”V0”-“V37”，这38个字段是作为特征变量，”target”作为目标变量。选手利用训练数据训练出模型，预测测试数据的目标变量，排名结果依据预测结果的MSE(mean square error)。

结果评估: 预测结果以mean square error作为评判标准。

2. AWS SageMaker

AWS SageMaker是亚马逊云科技提供的机器学习服务，它整合了专门为ML可用的功能集，帮助数据科学家和开发人员快速准备、构建、训练和部署高质量的机器学习模型。

我们首先使用的是 AWS SageMaker的Notebook Instance进行数据的探索、清洗以及准备。在Notebook Instance中运行了一个Jupyter notebook server，可以在其上编写代码并做相关测试。例如：

在Jupyter中创建一个conda_python3 的notebook，即可开始对数据进行探索与处理。

3. 数据探索

3.1. 初步探索

先简单查看一下数据：

import pandas as pd
import s3fs
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

plt.style.use('seaborn')
%matplotlib inline

train_raw = pd.read_csv(train_data_uri, sep='\t', encoding='utf-8')
test_raw = pd.read_csv(test_data_uri, sep='\t', encoding='utf-8')

train_raw.head()

train_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      2888 non-null   float64
 1   V1      2888 non-null   float64
 2   V2      2888 non-null   float64
 …
 37  V37     2888 non-null   float64
 38  target  2888 non-null   float64
dtypes: float64(39)
memory usage: 880.1 KB

从训练集 info 信息我们可以知道，在训练集中：

一共有2888 个样本， 38个字段(V0 - V37) ，1个 target
所有特征均为连续型特征
Label为连续型，所以我们需要回归函数进行预测
所有特征均没有空置

测试集 info()：

test_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Data columns (total 38 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      1925 non-null   float64
 1   V1      1925 non-null   float64
 2   V2      1925 non-null   float64
 …
 36  V36     1925 non-null   float64
 37  V37     1925 non-null   float64
dtypes: float64(38)
memory usage: 571.6 KB

从测试集info() 我们可以了解到，在测试集中：

一共有1925个样本，38个字段(V0 - V37)
所有特征均为连续型
没有缺失值

若是进一步对df 做 describe()，则会有 39 个字段的describe数据，从观察数据的角度来看，比较复杂，所以下一步我们对数据进行可视化。

3.2. 数据可视化

3.2.1. 盒图

首先我们通过boxplot 探索离群点，首先以特征V1为例：

fig = plt.figure(figsize=(4, 6))
sns.boxplot(train_raw[['V1']], orient='v', width=0.5, palette="Set3")

可以看到此特征有非常多的离群点。然后我们将所有特征进行盒图可视化：

# boxplot for all features
columns = train_raw.columns[:-1]

fig = plt.figure(figsize=(80, 100), dpi=75)
for i in range(len(columns)):
    plt.subplot(7, 6, i+1)
    sns.boxplot(train_raw[columns[i]], orient='v', width=0.5, palette="Set3")
    plt.ylabel(columns[i])
plt.show()

部分结果如下：

从这个结果来看，大部分特征或多或少均存在离群点，后续在特征工程阶段需要对此进行进一步处理。

3.2.2. 直方图与Q-Q图

接下来探索数据的分布情况，是否为正态分布。通过直方图与Q-Q图进行探索。

先以V0 特征为例：

plt.figure(figsize=(10, 5))

ax1 = plt.subplot(121)
sns.distplot(train_raw['V0'], fit=stats.norm)

ax2 = plt.subplot(122)
res = stats.probplot(train_raw['V0'], plot=plt)

可以看到训练集中V0 特征并非为正态分布。接下来我们绘制所有特征的直方图与Q-Q图：

import warnings

warnings.filterwarnings("ignore")

plt.figure(figsize=(80, 190))

ax_index = 1

for i in range(len(columns)):
    ax = plt.subplot(19, 4, ax_index)
    sns.distplot(train_raw[columns[i]], fit=stats.norm)
    ax_index += 1
    
    ax = plt.subplot(19, 4, ax_index)
    res = stats.probplot(train_raw[columns[i]], plot=plt)
    ax_index += 1

部分结果如下：

可以看到其中有的特征符合正态分布，但大部分并不符合，数据并不跟随对角线分布。对此，后续可以使用数据变换对其进行处理。

3.2.3. KDE分布图

KDE(Kernel Density Estimation，核密度估计)可以理解为是对直方图的加窗平滑。我们可以通过此图比较直观的看出数据本身的分布特征。

这里我们通过绘制KDE图，查看并对比训练集和测试集中特征变量的分布情况，来发现两个数据集中分布不一致的特征变量。

先仍以特征V0为例：

plt.figure(figsize=(10, 8))
ax = sns.kdeplot(train_raw['V0'], color="Red", shade=True)
ax = sns.kdeplot(test_raw['V0'], color="Blue", shade=True)
ax.set_xlabel("V0")
ax.set_ylabel("Frequency")
ax.legend(['train', 'test'])

可以看到 V0 在两个数据集中的分布基本一致。然后对所有特征画出训练集与测试集中的KDE分布：

# all features' kde plots

plt.figure(figsize=(40, 100))
ax_index = 1

for i in range(len(columns)):
    ax = plt.subplot(10, 4, ax_index)
    ax = sns.kdeplot(train_raw[columns[i]], color="Red"