16、树算法与集成学习：从决策树到随机森林的深入探索

最新推荐文章于 2025-12-19 20:24:40 发布

a1b2c3d

最新推荐文章于 2025-12-19 20:24:40 发布

阅读量15

点赞数

CC 4.0 BY-SA版权

分类专栏：精通scikit-learn实战文章标签：决策树随机森林集成学习

本文链接：https://blog.youkuaiyun.com/a1b2c3d/article/details/154556592

精通scikit-learn实战专栏收录该内容

20 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

树算法与集成学习：从决策树到随机森林的深入探索

在机器学习领域，树算法和集成学习是非常重要的技术，它们在分类和回归问题中都有广泛的应用。本文将详细介绍决策树的优化、决策树在回归问题中的应用、随机森林回归以及基于最近邻的装袋回归等内容，并提供具体的操作步骤和代码示例。

决策树性能优化

为了优化决策树的性能，我们可以使用 GridSearchCV 进行参数调优。以下是具体的操作步骤：
1. 实例化决策树 ：

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()

实例化并训练 GridSearchCV ：

from sklearn.model_selection import GridSearchCV, cross_val_score
param_grid = {'criterion':['gini','entropy'], 'max_depth' : [3,5,7,20]}
gs_inst = GridSearchCV(dtc,param_grid=param_grid,cv=5)
gs_inst.fit(X_train, y_train)

在参数网格 param_grid 中，我们对分裂评分标准（ gini 和 entropy ）和树的最大深度进行了调整。
3. 在测试集上评估准确率 ：

from sklearn.metrics import accuracy_score
y_pred_gs = gs_inst.predict(X_test)
accuracy_score(y_test, y_pred_gs)

运行上述代码后，我们可以得到优化后的准确率。
4. 查看网格搜索中所有决策树的得分 ：

gs_inst.grid_scores_

需要注意的是，此方法在未来的 scikit-learn 版本中可能会不可用，你可以使用 zip(gs_inst.cv_results_['mean_test_score'],gs_inst.cv_results_['params']) 来获得类似的结果。从得分列表中可以看出，较浅的树通常比深树表现更好，因为深树容易过拟合。
5. 选择性能最佳的树 ：

gs_inst.best_estimator_

使用 graphviz 可视化树 ：

import numpy as np
from sklearn import tree
from sklearn.externals.six import StringIO
import pydot
from IPython.display import Image
dot_iris = StringIO()
tree.export_graphviz(gs_inst.best_estimator_, out_file = dot_iris, feature_names = iris.feature_names[:2])
graph = pydot.graph_from_dot_data(dot_iris.getvalue())
Image(graph.create_png())

为了更深入地了解决策树，我们还可以进行额外的可视化操作：
1. 创建 NumPy 网格 ：

grid_interval = 0.02
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xmin, xmax = np.percentile(X[:, 0], [0, 100])
ymin, ymax = np.percentile(X[:, 1], [0, 100])
xmin_plot, xmax_plot = xmin - .5, xmax + .5
ymin_plot, ymax_plot = ymin - .5, ymax + .5
xx, yy = np.meshgrid(np.arange(xmin_plot, xmax_plot, grid_interval), np.arange(ymin_plot, ymax_plot, grid_interval))

在 NumPy 网格上进行预测 ：

test_preds = gs_inst.best_estimator_.predict(np.array(zip(xx.ravel(), yy.ravel())))

可视化结果 ：

import matplotlib.pyplot as plt
%matplotlib inline
X_0 = X[y == 0]
X_1 = X[y == 1]
X_2 = X[y == 2]
plt.figure(figsize=(15,8))
plt.scatter(X_0[:,0],X_0[:,1], color = 'red')
plt.scatter(X_1[:,0],X_1[:,1], color = 'blue')
plt.scatter(X_2[:,0],X_2[:,1], color = 'green')
colors = np.array(['r', 'b','g'])
plt.scatter(xx.ravel(), yy.ravel(), color=colors[test_preds], alpha=0.15)
plt.scatter(X[:, 0], X[:, 1], color=colors[y])
plt.title("Decision Tree Visualization")
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])

通过这种可视化方式，我们可以看到决策树是如何通过构建矩形来对鸢尾花的类型进行分类的。每个分裂都会创建一条垂直于某个特征的线。例如，我们可以添加以下代码来可视化前三条线：

plt.axvline(x = 5.45, color='black')
plt.axvline(x = 6.2, color='black')
plt.plot((xmin_plot, 5.45), (2.8, 2.8), color='black')

最后，我们还可以绘制最大深度对交叉验证得分的影响：

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
from sklearn.model_selection import GridSearchCV, cross_val_score
max_depths = range(2,51)
param_grid = {'max_depth' : max_depths}
gs_inst = GridSearchCV(dtc, param_grid=param_grid,cv=5)
gs_inst.fit(X_train, y_train)
plt.plot(max_depths,gs_inst.cv_results_['mean_test_score'])
plt.xlabel('Max Depth')
plt.ylabel("Cross-validation Score")

从图中可以看出，较大的最大深度往往会降低交叉验证得分。

决策树在回归问题中的应用

决策树在回归问题中的应用与分类问题非常相似，主要包括以下四个步骤：
1. 加载数据集 ：以 scikit-learn 的糖尿病数据集为例：

#Use within an Jupyter notebook
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
X_feature_names = ['age', 'gender', 'body mass index', 'average blood pressure','bl_0','bl_1','bl_2','bl_3','bl_4','bl_5']

将数据集拆分为训练集和测试集 ：

pd.Series(y).hist(bins=50)
bins = 50*np.arange(8)
binned_y = np.digitize(y, bins)
pd.Series(binned_y).hist(bins=50)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=binned_y)

实例化决策树回归器并进行训练 ：

from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)

在测试集上评估模型 ：

y_pred = dtr.predict(X_test)
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)
(np.abs(y_test - y_pred)/(y_test)).mean()

通过上述步骤，我们建立了糖尿病数据集的回归模型，并得到了相应的误差指标。

为了进一步了解模型的误差分布，我们可以使用 pandas 进行可视化：

pd.Series((y_test - y_pred)).hist(bins=50)
pd.Series((y_test - y_pred)/(y_test)).hist(bins=50)

最后，我们可以查看决策树本身，但需要注意的是，这里没有对最大深度进行优化，树可能会过拟合。

随机森林回归

随机森林是一种集成算法，它通过组合多个决策树来提高预测性能。以下是实现随机森林回归的具体步骤：
1. 加载数据集并拆分训练集和测试集 ：

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
X_feature_names = ['age', 'gender', 'body mass index', 'average blood pressure','bl_0','bl_1','bl_2','bl_3','bl_4','bl_5']
bins = 50*np.arange(8)
binned_y = np.digitize(y, bins)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=binned_y)

导入并实例化随机森林回归器并进行训练 ：

from sklearn.ensemble import RandomForestRegressor
rft = RandomForestRegressor()
rft.fit(X_train, y_train)

评估预测误差 ：

y_pred = rft.predict(X_test)
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)
(np.abs(y_test - y_pred)/(y_test)).mean()

与单个决策树相比，随机森林的误差有所降低。
4. 访问随机森林中的单个树 ：

rft.estimators_

可视化随机森林中的单个树 ：

import numpy as np
from sklearn import tree
from sklearn.externals.six import StringIO
import pydot
from IPython.display import Image
dot_diabetes = StringIO()
tree.export_graphviz(rft.estimators_[0], out_file = dot_diabetes, feature_names = X_feature_names)
graph = pydot.graph_from_dot_data(dot_diabetes.getvalue())
Image(graph.create_png())

确定特征重要性 ：

rft.feature_importances_

可视化特征重要性 ：

fig, ax = plt.subplots(figsize=(10,5))
bar_rects = ax.bar(np.arange(10), rft.feature_importances_,color='r',align='center')
ax.xaxis.set_ticks(np.arange(10))
ax.set_xticklabels(X_feature_names, rotation='vertical')

从可视化结果中可以看出，身体质量指数（BMI）、 bl_4 （六项血清测量中的第四项）和平均血压是最具影响力的特征。

基于最近邻的装袋回归

装袋是另一种集成学习方法，它通过在训练集的随机子集上构建多个基估计器的实例来工作。在本节中，我们将使用 k -最近邻（KNN）作为基估计器。以下是具体的操作步骤：
1. 加载数据集并拆分训练集和测试集 ：

import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
X_feature_names = ['age', 'gender', 'body mass index', 'average blood pressure','bl_0','bl_1','bl_2','bl_3','bl_4','bl_5']
bins = 50*np.arange(8)
binned_y = np.digitize(y, bins)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=binned_y)

导入必要的库 ：

from sklearn.ensemble import BaggingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import RandomizedSearchCV

设置参数分布 ：

param_dist = {
    'max_samples': [0.5,1.0],
    'max_features' : [0.5,1.0],
    'oob_score' : [True, False],
    'base_estimator__n_neighbors': [3,5],
    'n_estimators': [100]
}

实例化 KNeighborsRegressor 并将其作为基估计器传递给 BaggingRegressor ：

single_estimator = KNeighborsRegressor()
ensemble_estimator = BaggingRegressor(base_estimator = single_estimator)

实例化并运行随机搜索 ：

pre_gs_inst_bag = RandomizedSearchCV(ensemble_estimator, param_distributions = param_dist, cv=3, n_iter = 5, n_jobs=-1)
pre_gs_inst_bag.fit(X_train, y_train)

查看随机搜索中的最佳参数 ：

pre_gs_inst_bag.best_params_

使用最佳参数训练 BaggingRegressor ：

rs_bag = BaggingRegressor(**{'max_features': 1.0, 'max_samples': 0.5, 'n_estimators': 1000, 'oob_score': True, 'base_estimator': KNeighborsRegressor(n_neighbors=5)})
rs_bag.fit(X_train, y_train)

在测试集上评估性能 ：

y_pred = rs_bag.predict(X_test)
from sklearn.metrics import r2_score, mean_absolute_error
print "R-squared",r2_score(y_test, y_pred)
print "MAE : ",mean_absolute_error(y_test, y_pred)
print "MAPE : ",(np.abs(y_test - y_pred)/y_test).mean()

该算法的性能可能不如其他算法，但可以作为堆叠聚合器的一部分使用。

综上所述，树算法和集成学习在机器学习中具有重要的地位，通过合理选择和优化算法，我们可以提高模型的性能和泛化能力。希望本文的内容能够帮助你更好地理解和应用这些技术。

树算法与集成学习：从决策树到随机森林的深入探索

各算法性能对比

为了更直观地了解不同算法在糖尿病数据集上的性能，我们可以将前面提到的决策树回归、随机森林回归和基于最近邻的装袋回归的误差指标进行对比，如下表所示：
| 算法 | 平均绝对误差（MAE） | 平均绝对百分比误差（MAPE） | 决定系数（R²） |
| — | — | — | — |
| 决策树回归 | 58.49438202247191 | 0.4665997687095611 | - |
| 随机森林回归 | 48.539325842696627 | 0.42821508503434541 | - |
| 基于最近邻的装袋回归 | 44.3642741573 | 0.419361955306 | 0.498096653258 |

从表格中可以看出，基于最近邻的装袋回归在平均绝对误差和平均绝对百分比误差上表现相对较好，而随机森林回归的误差也比单个决策树回归小。这表明集成学习算法在一定程度上能够提高模型的性能。

算法选择与优化建议

在实际应用中，我们需要根据具体的问题和数据集来选择合适的算法。以下是一些选择和优化的建议：
- 决策树 ：
- 适用场景 ：当数据集较小、特征较少且对模型的可解释性要求较高时，决策树是一个不错的选择。
- 优化建议 ：通过 GridSearchCV 等方法对决策树的参数进行调优，如分裂评分标准（ gini 或 entropy ）和最大深度等，避免过拟合。
- 随机森林 ：
- 适用场景 ：对于大多数数据集，随机森林通常能够提供较好的性能，尤其是在处理高维数据和复杂问题时。
- 优化建议 ：可以尝试调整随机森林的参数，如树的数量、最大特征数等，以进一步提高性能。
- 基于最近邻的装袋回归 ：
- 适用场景 ：当基估计器（如KNN）的方差较大时，装袋方法可以有效降低方差，提高模型的稳定性。
- 优化建议 ：通过随机网格搜索等方法找到最佳的参数组合，并适当增加估计器的数量。

算法流程总结

为了更清晰地展示不同算法的操作流程，我们可以使用mermaid流程图来进行总结。

graph LR
    classDef startend fill:#F5EBFF,stroke:#BE8FED,stroke-width:2px
    classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px
    classDef decision fill:#FFF6CC,stroke:#FFBC52,stroke-width:2px

    A([开始]):::startend --> B(加载数据集):::process
    B --> C(拆分训练集和测试集):::process
    C --> D{选择算法}:::decision
    D -->|决策树回归| E(实例化决策树回归器):::process
    D -->|随机森林回归| F(实例化随机森林回归器):::process
    D -->|基于最近邻的装袋回归| G(实例化KNN和装袋回归器):::process
    E --> H(训练模型):::process
    F --> H
    G --> H
    H --> I(在测试集上评估模型):::process
    I --> J(输出误差指标):::process
    J --> K([结束]):::startend

代码复用与扩展

在实际开发中，我们可以将上述代码封装成函数，以便复用和扩展。以下是一个简单的示例，展示如何封装决策树回归的代码：

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

def decision_tree_regression():
    # 加载数据集
    diabetes = load_diabetes()
    X = diabetes.data
    y = diabetes.target
    X_feature_names = ['age', 'gender', 'body mass index', 'average blood pressure','bl_0','bl_1','bl_2','bl_3','bl_4','bl_5']

    # 拆分数据集
    bins = 50*np.arange(8)
    binned_y = np.digitize(y, bins)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=binned_y)

    # 实例化决策树回归器并训练
    dtr = DecisionTreeRegressor()
    dtr.fit(X_train, y_train)

    # 评估模型
    y_pred = dtr.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    mape = (np.abs(y_test - y_pred)/(y_test)).mean()

    return mae, mape

# 调用函数
mae, mape = decision_tree_regression()
print(f"平均绝对误差: {mae}")
print(f"平均绝对百分比误差: {mape}")

通过封装代码，我们可以方便地在不同的项目中复用该函数，并且可以根据需要对函数进行扩展，如添加参数调优的功能。

总结

本文详细介绍了树算法和集成学习在分类和回归问题中的应用，包括决策树的优化、决策树回归、随机森林回归和基于最近邻的装袋回归等内容。通过具体的操作步骤和代码示例，我们展示了如何使用这些算法解决实际问题，并对不同算法的性能进行了对比。在实际应用中，我们可以根据具体的问题和数据集选择合适的算法，并通过参数调优等方法优化模型的性能。希望本文能够帮助你更好地理解和应用树算法和集成学习技术。