机器学习实战：决策树分类算法详解（基于Nyandwi项目）-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00361/article/details/148508650

机器学习实战：决策树分类算法详解（基于Nyandwi项目）

machine_learning_complete Nyandwi/machine_learning_complete: 是一个包含各种机器学习算法的 Python 代码库。适合对机器学习、Python 以及想要使用各种机器学习算法的开发者。项目地址: https://gitcode.com/gh_mirrors/ma/machine_learning_complete

决策树分类算法概述

决策树是一种强大的监督学习算法，可用于分类和回归任务。与神经网络等"黑盒"模型不同，决策树具有高度可解释性，能够生成清晰的决策规则。在Nyandwi的机器学习项目中，决策树被作为基础算法之一进行了详细讲解和实践。

决策树的7大特点

无需特征缩放：决策树对数值特征的尺度不敏感，原始数值可直接使用
原生支持类别特征：可直接处理文本形式的类别特征（虽然Scikit-Learn实现需要编码）
模型可解释性强：决策过程可直观展示，不像深度学习模型那样难以理解
缺失值容忍度高：能够有效处理包含缺失值的数据
处理类别不平衡：通过调整类别权重可应对不平衡数据集
提供特征重要性：可评估各特征对模型的贡献程度
集成学习基础：是随机森林和梯度提升树等强大算法的构建模块

决策树的工作原理类似于一系列if/else问题。例如决定购买哪款车时，我们会通过评估安全性、座位数、车门数等属性，形成决策逻辑链。

项目实战：汽车可接受性预测

数据集介绍

使用OpenML提供的汽车评估数据集，预测汽车的可接受性。数据集特征包括：

buying：购买价格（vhigh, high, med, low）
maint：维护价格（vhigh, high, med, low）
doors：车门数量（2,3,4,5more）
persons：载客量（2,4,more）
lug_boot：行李箱大小（small, med, big）
safety：安全等级（low, med, high）
binaryClass：目标变量，汽车可接受性（P/N）

数据准备

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml

# 加载数据集
car_data = fetch_openml(name='car', version=2)
car_df = car_data.frame

# 划分训练测试集
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(car_df, test_size=0.1, random_state=20)

探索性分析

检查数据基本情况：

# 查看数据摘要
print(train_data.describe())

# 检查缺失值
print(train_data.isnull().sum())

数据预处理

决策树可以直接处理类别特征，但Scikit-Learn实现需要将文本特征转换为数值：

from sklearn.preprocessing import OrdinalEncoder

# 定义特征和标签
features = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']
target = 'binaryClass'

# 类别特征编码
encoder = OrdinalEncoder()
X_train = encoder.fit_transform(train_data[features])
y_train = train_data[target].values

模型训练

使用Scikit-Learn的决策树分类器：

from sklearn.tree import DecisionTreeClassifier

# 初始化决策树
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)

# 训练模型
tree_clf.fit(X_train, y_train)

模型评估

评估模型性能：

from sklearn.metrics import accuracy_score, classification_report

# 测试集预处理
X_test = encoder.transform(test_data[features])
y_test = test_data[target].values

# 预测
y_pred = tree_clf.predict(X_test)

# 评估
print("准确率:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

决策树可视化

理解模型决策过程：

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
plot_tree(tree_clf, feature_names=features, 
          class_names=['N','P'], filled=True)
plt.show()

特征重要性分析

importances = tree_clf.feature_importances_
indices = np.argsort(importances)[::-1]

print("特征重要性排序:")
for f in range(X_train.shape[1]):
    print(f"{features[indices[f]]}: {importances[indices[f]]:.4f}")

模型优化

防止过拟合的常用方法：

限制树深度：通过max_depth参数控制
设置叶节点最小样本数：min_samples_leaf
剪枝：ccp_alpha参数
交叉验证：寻找最优参数

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth': [3, 5, 7, None],
    'min_samples_leaf': [1, 2, 3]
}

grid_search = GridSearchCV(DecisionTreeClassifier(), 
                         params, 
                         cv=5)
grid_search.fit(X_train, y_train)

print("最佳参数:", grid_search.best_params_)