决策树算法实战:从玩具数据集到UCI Adult数据集分析
mlcourse.ai Open Machine Learning Course 项目地址: https://gitcode.com/gh_mirrors/ml/mlcourse.ai
决策树是机器学习中最基础但功能强大的算法之一。本文将基于mlcourse.ai项目中的Assignment #3内容,深入浅出地讲解决策树的工作原理,并通过两个实际案例演示如何应用决策树解决分类问题。
第一部分:玩具数据集"Will They? Won't They?"
问题描述
我们有一个有趣的二分类问题:预测人物A是否会与人物B进行第二次约会。这个预测基于四个特征:
- 外貌(Looks):handsome(英俊)或repulsive(令人反感)
- 饮酒(Alcoholic_beverage):yes或no
- 口才(Eloquence):high(高)、average(一般)或low(低)
- 花费金额(Money_spent):lots(多)或little(少)
数据准备
首先我们创建训练集和测试集:
# 训练数据
df_train = {
"Looks": ["handsome", "handsome", "handsome", "repulsive", "repulsive", "repulsive", "handsome"],
"Alcoholic_beverage": ["yes", "yes", "no", "no", "yes", "yes", "yes"],
"Eloquence": ["high", "low", "average", "average", "low", "high", "average"],
"Money_spent": ["lots", "little", "lots", "little", "lots", "lots", "lots"],
"Will_go": LabelEncoder().fit_transform(["+", "-", "+", "-", "-", "+", "+"])
}
# 测试数据
df_test = {
"Looks": ["handsome", "handsome", "repulsive"],
"Alcoholic_beverage": ["no", "yes", "yes"],
"Eloquence": ["average", "high", "average"],
"Money_spent": ["lots", "little", "lots"]
}
决策树基础概念
在构建决策树之前,我们需要理解几个关键概念:
-
熵(Entropy):衡量系统不确定性的指标。对于二分类问题,熵的计算公式为: $$S = -p_0 \log_2 p_0 - p_1 \log_2 p_1$$ 其中$p_0$和$p_1$分别是两类样本的比例。
-
信息增益(Information Gain):衡量某个特征对分类效果的提升程度。计算公式为: $$IG = S_{\text{父节点}} - \sum_{i} \frac{N_i}{N} S_{\text{子节点i}}$$
手动计算示例
让我们手动计算初始系统的熵:
初始系统有7个样本,其中4个正例(Will_go=1),3个负例(Will_go=0)。
初始熵$S_0$ = $- \frac{4}{7} \log_2 \frac{4}{7} - \frac{3}{7} \log_2 \frac{3}{7} \approx 0.985$
如果按照"Looks_handsome"特征划分:
- 左组(handsome):4个样本(3正1负) $S_1 \approx 0.811$
- 右组(非handsome):3个样本(1正2负) $S_2 \approx 0.918$
信息增益$IG = S_0 - (\frac{4}{7}S_1 + \frac{3}{7}S_2) \approx 0.985 - 0.857 = 0.128$
使用sklearn构建决策树
from sklearn.tree import DecisionTreeClassifier
# 创建决策树模型,设置最大深度为3
tree = DecisionTreeClassifier(max_depth=3, random_state=17)
tree.fit(df_train.drop('Will_go', axis=1), df_train['Will_go'])
第二部分:熵和信息增益的函数实现
为了更深入理解决策树的工作原理,我们实现计算熵和信息增益的函数:
def entropy(a_list):
"""计算给定列表的熵"""
counts = np.bincount(a_list)
probs = counts / len(a_list)
return -np.sum([p * np.log2(p) for p in probs if p > 0])
def information_gain(root, left, right):
"""计算信息增益"""
H_root = entropy(root)
H_left = entropy(left)
H_right = entropy(right)
gain = H_root - (len(left)/len(root)*H_left + len(right)/len(root)*H_right)
return gain
测试示例:
balls = [1 for i in range(9)] + [0 for i in range(11)] # 9蓝11黄
balls_left = [1 for i in range(8)] + [0 for i in range(5)] # 8蓝5黄
balls_right = [1 for i in range(1)] + [0 for i in range(6)] # 1蓝6黄
print(entropy(balls)) # 0.9927744539878084
print(entropy(balls_left)) # 0.961236604722876
print(entropy(balls_right)) # 0.5916727785823275
print(information_gain(balls, balls_left, balls_right)) # 0.16123660472287587
第三部分:UCI Adult数据集实战
数据集介绍
UCI Adult数据集是一个经典的分类数据集,目标是根据人口普查数据预测个人年收入是否超过5万美元。包含以下特征:
- 连续特征:年龄、教育年限、资本收益、资本损失、每周工作时间等
- 分类特征:工作类型、教育程度、婚姻状况、职业、种族、性别等
数据预处理
- 读取数据并处理缺失值
- 将分类特征进行独热编码(One-Hot Encoding)
- 确保训练集和测试集特征一致
# 读取数据
data_train = pd.read_csv("adult_train.csv", sep=";")
data_test = pd.read_csv("adult_test.csv", sep=";")
# 处理缺失值
for c in categorical_columns:
data_train[c].fillna(data_train[c].mode()[0], inplace=True)
data_test[c].fillna(data_train[c].mode()[0], inplace=True)
for c in numerical_columns:
data_train[c].fillna(data_train[c].median(), inplace=True)
data_test[c].fillna(data_train[c].median(), inplace=True)
# 独热编码
data_train = pd.concat([data_train[numerical_columns],
pd.get_dummies(data_train[categorical_columns])], axis=1)
决策树模型训练与调优
- 基础决策树模型(未调参):
tree = DecisionTreeClassifier(max_depth=3, random_state=17)
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.3f}")
- 使用网格搜索调优:
from sklearn.model_selection import GridSearchCV
tree_params = {'max_depth': range(2, 11)}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=17),
tree_params, cv=5)
grid_search.fit(X_train, y_train)
best_tree = grid_search.best_estimator_
best_predictions = best_tree.predict(X_test)
print(f"Best params: {grid_search.best_params_}")
print(f"Best accuracy: {accuracy_score(y_test, best_predictions):.3f}")
进阶:随机森林
随机森林通过构建多个决策树并综合它们的预测结果,通常能获得更好的性能:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=17)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_predictions):.3f}")
总结
本文通过三个部分系统讲解了决策树算法:
- 使用玩具数据集理解决策树的基本概念和工作原理
- 手动实现熵和信息增益的计算函数
- 在真实数据集UCI Adult上应用决策树,并进行参数调优
决策树虽然简单,但通过适当的调优和集成方法(如随机森林),可以获得相当不错的分类性能。理解这些基础算法的工作原理,对于掌握更复杂的机器学习模型至关重要。
mlcourse.ai Open Machine Learning Course 项目地址: https://gitcode.com/gh_mirrors/ml/mlcourse.ai
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考