19、机器学习中的文本、多类分类及神经网络应用-优快云博客

本文链接：https://blog.youkuaiyun.com/a1b2c3d/article/details/154556604

机器学习中的文本、多类分类及神经网络应用

1. LabelSpreading与LabelPropagation

LabelSpreading是LabelPropagation的姊妹类，它们非常相似。以下是使用LabelSpreading的示例代码：

ls = semi_supervised.LabelSpreading()
ls.fit(X, y)
# 输出的模型参数
LabelSpreading(alpha=0.2, gamma=20, kernel='rbf', max_iter=30, n_jobs=1, n_neighbors=7, tol=0.001)
# 测量准确率
(ls.predict(X) == d.target).mean()

从其工作方式来看，LabelSpreading更具鲁棒性，但也存在一定噪声。其工作原理是创建数据点的图，边的权重根据特定公式设置，标记的数据点将其标签传播到未标记的数据，传播部分由边的权重决定，边的权重可以放在转移概率矩阵中，通过迭代确定实际标签的良好估计。

2. 感知机分类器

2.1 准备工作

感知机是神经网络的基本构建块，在机器学习特别是计算机视觉中非常重要。使用感知机分类器的步骤如下：
1. 加载UCI糖尿病分类数据集。
2. 将数据集拆分为训练集和测试集。
3. 导入感知机。
4. 实例化感知机。
5. 训练感知机。
6. 在测试集上测试感知机或计算交叉验证分数。

以下是加载数据集的代码：

import numpy as np
import pandas as pd
data_web_address = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
column_names = ['pregnancy_x', 'plasma_con', 'blood_pressure', 'skin_mm', 'insulin', 'bmi', 'pedigree_func', 'age', 'target']
feature_names = column_names[:-1]
all_data = pd.read_csv(data_web_address, names=column_names)
X = all_data[feature_names]
y = all_data['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

2.2 具体操作

特征缩放 ：仅在训练集上执行缩放操作，然后应用到测试集。

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

实例化并训练感知机 ：

from sklearn.linear_model import Perceptron
pr = Perceptron()
pr.fit(X_train_scaled, y_train)
# 输出的模型参数
Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True, n_iter=5, n_jobs=1, penalty=None, random_state=0, shuffle=True, verbose=0, warm_start=False)

测量交叉验证分数 ：使用 roc_auc 作为交叉验证评分机制，并使用分层k折。

from sklearn.model_selection import cross_val_score, StratifiedKFold
skf = StratifiedKFold(n_splits=3)
cross_val_score(pr, X_train_scaled, y_train, cv=skf, scoring='roc_auc').mean()

测量测试集性能 ：导入 accuracy_score 和 roc_auc_score 进行评估。

from sklearn.metrics import accuracy_score, roc_auc_score
print "Classification accuracy : ", accuracy_score(y_test, pr.predict(X_test_scaled))
print "ROC - AUC Score : ", roc_auc_score(y_test, pr.predict(X_test_scaled))

测试结果显示，感知机的表现尚可，但比逻辑回归稍差。

感知机是大脑中神经元的简化模型，它接收输入，计算偏置项和权重，形成线性函数，然后通过激活函数进行分类。在每次迭代中，权重会重新调整以最小化损失函数。

随着计算能力的提高，神经网络和感知机能够解决越来越复杂的问题，训练时间也在不断减少。

2.3 超参数调整

可以通过网格搜索来调整感知机的超参数，如 alpha 、 penalty 、 class_weight 和 max_iter 等。以下是网格搜索的代码：

from sklearn.model_selection import GridSearchCV
param_dist = {'alpha': [0.1, 0.01, 0.001, 0.0001],
              'penalty': [None, 'l2', 'l1', 'elasticnet'],
              'random_state': [7],
              'class_weight': ['balanced', None],
              'eta0': [0.25, 0.5, 0.75, 1.0],
              'warm_start': [True, False],
              'max_iter': [50],
              'tol': [1e - 3]}
gs_perceptron = GridSearchCV(pr, param_dist, scoring='roc_auc', cv=skf).fit(X_train_scaled, y_train)

查看最佳参数和最佳分数：

gs_perceptron.best_params_
gs_perceptron.best_score_

通过交叉验证调整超参数可以提高结果。还可以使用感知机的集成方法，如装袋法（Bagging）。

以下是装袋法的代码：

best_perceptron = gs_perceptron.best_estimator_
from sklearn.ensemble import BaggingClassifier
param_dist = {
    'max_samples': [0.5, 1.0],
    'max_features': [0.5, 1.0],
    'oob_score': [True, False],
    'n_estimators': [100],
    'n_jobs': [-1],
    'base_estimator__alpha': [0.001, 0.002],
    'base_estimator__penalty': [None, 'l2', 'l1', 'elasticnet']
}
ensemble_estimator = BaggingClassifier(base_estimator=best_perceptron)
bag_perceptrons = GridSearchCV(ensemble_estimator, param_dist, scoring='roc_auc', cv=skf, n_jobs=-1).fit(X_train_scaled, y_train)

查看新的交叉验证分数和最佳参数：

bag_perceptrons.best_score_
bag_perceptrons.best_params_

对于这个数据集，感知机的装袋集成方法比单个感知机表现更好。

3. 多层感知机神经网络

3.1 准备工作

使用多层感知机神经网络的步骤如下：
1. 加载数据。
2. 使用标准缩放器对数据进行缩放。
3. 进行超参数搜索，首先调整 alpha 参数。

加载加利福尼亚住房数据集的代码如下：

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
cali_housing = fetch_california_housing()
X = cali_housing.data
y = cali_housing.target
bins = np.arange(6)
binned_y = np.digitize(y, bins)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=binned_y)

3.2 具体操作

特征缩放 ：仅在训练数据上训练缩放器，然后应用到测试集。

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

随机搜索超参数 ：

from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPRegressor
param_grid = {'alpha': [10, 1, 0.1, 0.01],
              'hidden_layer_sizes': [(50, 50, 50), (50, 50, 50, 50, 50)],
              'activation': ['relu', 'logistic'],
              'solver': ['adam']}
pre_gs_inst = RandomizedSearchCV(MLPRegressor(random_state=7),
                                 param_distributions=param_grid,
                                 cv=3,
                                 n_iter=15,
                                 random_state=7)
pre_gs_inst.fit(X_train_scaled, y_train)

查看最佳分数和最佳参数：

pre_gs_inst.best_score_
pre_gs_inst.best_params_

3.3 工作原理

在神经网络中，单个感知机的输出是权重和输入的点积之和的函数，激活函数可以是sigmoid曲线等。在scikit - learn中，激活函数有 identity 、 logistic 、 tanh 和 relu 等选项。

使用神经网络处理加利福尼亚住房数据集，该数据集似乎更适合非线性算法，如树和树的集成。最终，神经网络的表现尚可，但不如梯度提升机，并且计算成本较高。

4. 神经网络的哲学思考

神经网络在数学上是通用函数逼近器，可以学习任何函数。隐藏层通常被解释为网络学习过程的中间步骤，而无需人类编程。这种观点可以应用于其他估计器，例如随机森林。虽然神经网络是否真正具有智能存在争议，但在该领域发展和机器变得越来越智能的过程中，这种思维方式是有帮助的，同时应关注实际结果。

5. 神经网络堆叠

堆叠是一种元学习方法，虽然不如装袋和提升方法常用，但它可以组合不同类型的模型。堆叠的过程如下：
1. 将数据集拆分为训练集和测试集。
2. 将训练集拆分为两个子集。
3. 在第一个训练子集上训练基础学习器。
4. 使用基础学习器在第二个训练子集上进行预测。
5. 存储这些预测向量。
6. 将存储的预测向量作为输入，目标变量作为输出，在第二个训练子集上训练更高级的学习器。
7. 在测试集上查看整体过程的结果。

以下是堆叠过程的mermaid流程图：

graph LR
    A[数据集] --> B(训练集和测试集)
    B --> C(训练集分为两个子集)
    C --> D(训练基础学习器)
    D --> E(基础学习器预测)
    E --> F(存储预测向量)
    F --> G(训练高级学习器)
    G --> H(测试集评估)

在堆叠中，我们将使用三个基础回归器：神经网络、单个梯度提升机和梯度提升机的装袋集成。具体操作将在后续部分详细介绍。

机器学习中的文本、多类分类及神经网络应用

6. 堆叠具体操作

6.1 第一个基础模型 - 神经网络

在第一个训练集 X_1 和目标集 y_1 上进行交叉验证的网格搜索，以找到适合该数据集的神经网络最佳参数，这里仅调整 alpha 参数，同时不要忘记对输入进行缩放，否则网络运行效果不佳。

from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

mlp_pipe = Pipeline(steps=[('scale', StandardScaler()), ('neural_net', MLPRegressor())])
param_grid = {
    'neural_net__alpha': [0.02, 0.01, 0.005],
    'neural_net__hidden_layer_sizes': [(50, 50, 50)],
    'neural_net__activation': ['relu'],
    'neural_net__solver': ['adam']
}
neural_net_gs = GridSearchCV(mlp_pipe, param_grid=param_grid, cv=3, n_jobs=-1)
neural_net_gs.fit(X_1, y_1)

查看网格搜索的最佳参数和最佳分数：

neural_net_gs.best_params_
neural_net_gs.best_score_

将网格搜索中表现最佳的神经网络进行序列化保存，避免重复训练：

nn_best = neural_net_gs.best_estimator_
import pickle
f = open('nn_best.save', 'wb')
pickle.dump(nn_best, f, protocol=pickle.HIGHEST_PROTOCOL)
f.close()

6.2 第二个基础模型 - 梯度提升集成

对梯度提升树进行随机网格搜索：

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor

param_grid = {
    'learning_rate': [0.1, 0.05, 0.03, 0.01],
    'loss': ['huber'],
    'max_depth': [5, 7, 10],
    'max_features': [0.4, 0.6, 0.8, 1.0],
    'min_samples_leaf': [2, 3, 5],
    'n_estimators': [100],
    'warm_start': [True],
    'random_state': [7]
}
boost_gs = RandomizedSearchCV(GradientBoostingRegressor(), param_distributions=param_grid, cv=3, n_jobs=-1, n_iter=25)
boost_gs.fit(X_1, y_1)

查看最佳分数和参数：

boost_gs.best_score_
boost_gs.best_params_

增加估计器数量并进行训练：

gbt_inst = GradientBoostingRegressor(**{
    'learning_rate': 0.1,
    'loss': 'huber',
    'max_depth': 10,
    'max_features': 0.4,
    'min_samples_leaf': 5,
    'n_estimators': 4000,
    'warm_start': True,
    'random_state': 7
}).fit(X_1, y_1)

将估计器进行序列化保存，封装成函数方便使用：

def pickle_func(filename, saved_object):
    import pickle
    f = open(filename, 'wb')
    pickle.dump(saved_object, f, protocol=pickle.HIGHEST_PROTOCOL)
    f.close()
    return None

pickle_func('grad_boost.save', gbt_inst)

6.3 第三个基础模型 - 梯度提升集成的装袋回归器

对梯度提升树的装袋集成进行小范围的网格搜索：

from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'max_samples': [0.5, 1.0],
    'max_features': [0.5, 1.0],
    'oob_score': [True, False],
    'base_estimator__min_samples_leaf': [4, 5],
    'n_estimators': [20]
}
single_estimator = GradientBoostingRegressor(**{
    'learning_rate': 0.1,
    'loss': 'huber',
    'max_depth': 10,
    'max_features': 0.4,
    'n_estimators': 20,
    'warm_start': True,
    'random_state': 7
})
ensemble_estimator = BaggingRegressor(base_estimator=single_estimator)
pre_gs_inst_bag = RandomizedSearchCV(ensemble_estimator, param_distributions=param_dist, cv=3, n_iter=5, n_jobs=-1)
pre_gs_inst_bag.fit(X_1, y_1)

查看最佳参数和分数：

pre_gs_inst_bag.best_score_
pre_gs_inst_bag.best_params_

7. 总结

本文介绍了多种机器学习方法，包括 LabelSpreading、感知机分类器、多层感知机神经网络、神经网络的哲学思考以及神经网络堆叠等内容。以下是这些方法的对比表格：
| 方法 | 优点 | 缺点 | 适用场景 |
| ---- | ---- | ---- | ---- |
| LabelSpreading | 鲁棒性较强 | 存在一定噪声 | 半监督学习场景 |
| 感知机分类器 | 结构简单，训练速度相对较快 | 表现可能不如一些复杂模型 | 基础分类任务 |
| 多层感知机神经网络 | 可以处理复杂的非线性关系 | 计算成本高，训练时间长 | 复杂的回归和分类任务 |
| 神经网络堆叠 | 可以组合不同类型的模型 | 实现相对复杂 | 需要综合多种模型优势的场景 |

通过对这些方法的学习和实践，我们可以根据具体的数据集和任务需求选择合适的方法，同时通过调整超参数和使用集成方法等手段来提高模型的性能。在实际应用中，我们应该关注模型的实际效果，不断尝试和优化，以达到更好的结果。

以下是一个简单的选择模型的决策流程图：

graph LR
    A[数据集特点] --> B{数据规模小?}
    B -- 是 --> C(感知机分类器)
    B -- 否 --> D{数据非线性程度高?}
    D -- 是 --> E(多层感知机神经网络)
    D -- 否 --> F{需要综合多种模型?}
    F -- 是 --> G(神经网络堆叠)
    F -- 否 --> H(其他线性模型)