最强总结机器学习模型，梯度提升回归！！

奋进小青

于 2025-02-22 10:00:55 发布

阅读量470

点赞数 3

文章标签：机器学习回归人工智能

本文链接：https://blog.youkuaiyun.com/2201_75910862/article/details/145790687

版权

算法细节

首先，梯度提升回归是一种迭代的集成学习方法，其目标是利用多个弱学习器（通常为决策树）逐步修正前一模型的不足，最终构成一个强预测模型。

它采用加法模型的思想，每一步都在已有模型上加上一个新的弱学习器，方向选取是沿着损失函数的负梯度方向前进，从而实现损失的最小化。

模型训练

接下来咱们通过一个虚拟数据案例，用 Python 代码实现梯度提升回归的训练过程~

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

# 1. 数据生成
np.random.seed(42)
n_samples = 500
X = np.linspace(0, 10, n_samples).reshape(-1, 1)
# 目标函数：非线性正弦函数，加上均值为0、标准差为0.3的高斯噪声
y = np.sin(X).ravel() + np.random.normal(scale=0.3, size=n_samples)

# 数据集划分（训练集与测试集）
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. 模型训练：设置弱学习器的个数、学习率和树的最大深度
gb_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_reg.fit(X_train, y_train)

# 3. 模型预测
y_train_pred = gb_reg.predict(X_train)
y_test_pred = gb_reg.predict(X_test)

# 4. 绘制 Training Loss Curve（训练损失曲线）
# gb_reg.train_score_ 为每轮迭代在训练集上的损失
train_loss = gb_reg.train_score_

plt.figure(figsize=(10, 6))
plt.plot(np.arange(len(train_loss)), train_loss, color='red', lw=2, marker='o', label='Training Loss')
plt.xlabel('Number of Estimators', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Training Loss Curve', fontsize=14)
plt.legend()
plt.grid(True)
plt.show()

# 5. 绘制 Predicted vs True Plot（预测值与真实值对比图）
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', alpha=0.6, label='True Values', s=50)
plt.scatter(X_test, y_test_pred, color='green', alpha=0.6, label='Predicted Values', s=50)
plt.xlabel('Input Feature', fontsize=12)
plt.ylabel('Target Value', fontsize=12)
plt.title('Predicted vs True Values', fontsize=14)
plt.legend()
plt.grid(True)
plt.show()

# 6. 绘制 Residual Plot（残差图）
residuals = y_test - y_test_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_test_pred, residuals, color='magenta', alpha=0.6, s=50)
plt.axhline(y=0, color='black', linestyle='--', lw=2)
plt.xlabel('Predicted Values', fontsize=12)
plt.ylabel('Residuals', fontsize=12)
plt.title('Residual Plot', fontsize=14)
plt.grid(True)
plt.show()

# 7. 绘制 Residual Histogram（残差直方图）
plt.figure(figsize=(10, 6))
plt.hist(residuals, bins=30, color='orange', edgecolor='black', alpha=0.8)
plt.xlabel('Residual', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Residual Histogram', fontsize=14)
plt.grid(True)
plt.show()

数据生成与划分：利用正弦函数构造非线性关系，并加入噪声，随后划分训练集与测试集，以便分别训练模型和评估模型泛化能力。
模型训练：利用GradientBoostingRegressor构造模型，设定弱学习器个数为 100、学习率为 0.1、最大树深为 3。训练过程中，每轮迭代会计算当前模型在训练集上的损失，并保存到train_score_中。

数据分析方面，我们整理了四方面图表供大家理解：