大家好!上一篇文章我们介绍了什么是决策树(👇点击回顾),今天我们来聊聊决策树家族的三大经典算法:CART、ID3和C4.5。它们就像三位武林高手🥷,各有绝技,在机器学习江湖中占据重要地位。让我们一起来揭开它们的神秘面纱吧!
📚 决策树三种算法对比(速查表)
特性 | ID3 | C4.5 | CART |
---|---|---|---|
适用场景 | 分类 | 分类 | 分类+回归 |
分裂标准 | 信息增益 | 增益率 | 基尼指数/均方误差 |
树结构 | 多叉树 | 多叉树 | 二叉树 |
连续值处理 | ❌ | ✅ | ✅ |
缺失值处理 | ❌ | ✅ | ✅ |
主流工具 | 少 | 少 | sklearn |
剪枝 | 不支持 | 支持 | 支持 |
特征类型 | 仅离散 | 离散+连续 | 离散+连续 |
📊 三大决策树算法详解
⚙️ 1. ID3算法:信息增益的奠基者
定义:是最早的决策树算法之一,核心思想是选择信息增益最大的特征进行划分。ID3 只能对离散属性的数据集构造决策树。
原理:
H(D)
:数据集D的熵(混乱度)D_v
:按特征取值划分的子集
特点:- 🚫 只能处理离散特征
- 🚫 容易偏向取值多的特征
- 🚫 不支持剪枝,容易过拟合
示例场景:
- 天气预报(晴/雨/阴)
- 水果分类(苹果/香蕉/橙子)
💻 ID3简易代码实现:
import numpy as np
from collections import Counter
class ID3DecisionTree:
def __init__(self, max_depth=3):
self.max_depth = max_depth
self.tree = None
def _entropy(self, y):
"""计算熵"""
counts = np.bincount(y)
probs = counts / len(y)
return -np.sum([p * np.log2(p) for p in probs if p > 0])
def _best_split(self, X, y, feature_types):
"""寻找最佳划分特征(仅离散特征)"""
best_gain = -1
best_feature = None
current_entropy = self._entropy(y)
for feature in range(X.shape[1]):
if feature_types[feature] == 'continuous': # ID3不支持连续特征
continue
values = np.unique(X[:, feature])
for value in values:
left_indices = X[:, feature] == value
right_indices = ~left_indices
if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
continue
left_entropy = self._entropy(y[left_indices])
right_entropy = self._entropy(y[right_indices])
gain = current_entropy - \
(len(y[left_indices])/len(y)) * left_entropy - \
(len(y[right_indices])/len(y)) * right_entropy
if gain > best_gain:
best_gain = gain
best_feature = feature
return best_feature
def _build_tree(self, X, y, feature_types, depth=0):
"""递归构建决策树"""
n_samples, n_features = X.shape
n_classes = len(np.unique(y))
if (depth >= self.max_depth or
n_samples < 2 or
n_classes == 1):
leaf_value = Counter(y).most_common(1)[0][0]
return {'leaf': True, 'value': leaf_value}
feature = self._best_split(X, y, feature_types)
if feature is None:
leaf_value = Counter(y).most_common(1)[0][0]
return {'leaf': True, 'value': leaf_value}
values = np.unique(X[:, feature])
subtrees = {}
for value in values:
indices = X[:, feature] == value
subtrees[value] = self._build_tree(X[indices], y[indices], feature_types, depth+1)
return {
'leaf': False,
'feature': feature,
'subtrees': subtrees
}
def fit(self, X, y, feature_types):
"""训练决策树"""
self.tree = self._build_tree(X, y, feature_types)
def _predict_sample(self, x, tree):
"""预测单个样本"""
if tree['leaf']:
return tree['value']
feature_value = x[tree['feature']]
if feature_value not in tree['subtrees']:
# 默认选择最常见的类别
return max(tree['subtrees'].values(), key=lambda x: x['value'] if x['leaf'] else 0)['value']
return self._predict_sample(x, tree['subtrees'][feature_value])
def predict(self, X):
"""预测"""
return np.array([self._predict_sample(x, self.tree) for x in X])
# 示例使用
data = np.array([
['Sunny', 'Hot', 'High', 'Weak', 'No'],
['Sunny', 'Hot', 'High', 'Strong', 'No'],
['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Strong', 'No'],
['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
['Sunny', 'Mild', 'High', 'Weak', 'No'],
['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Strong', 'No']
])
X = data[:, :-1]
y = np.array([1 if label == 'Yes' else 0 for label in data[:, -1]])
feature_types = ['discrete'] * 4 # 所有特征都是离散的
tree = ID3DecisionTree(max_depth=3)
tree.fit(X, y, feature_types)
# 测试预测
test_sample = np.array(['Sunny', 'Cool', 'High', 'Strong'])
prediction = tree.predict([test_sample])
print(f"预测结果: {'Yes' if prediction[0] == 1 else 'No'}") # 输出: No
📏 2. C4.5算法:信息增益率的优化版
定义:ID3的改进版,用信息增益率解决特征偏向问题。主要解决了ID3的几个问题:
- 支持连续特征(通过二分法)
- 使用信息增益率代替信息增益(避免偏好取值多的特征)
- 支持剪枝(防止过拟合)
原理:
IV(A)
:特征固有值(惩罚多值特征)
特点:- ✅ 支持连续特征
- ✅ 使用信息增益率
- ✅ 支持剪枝
- ✅ 输出多叉树
示例场景:
- 医疗诊断(症状+数值指标)
- 信用评分(离散特征+收入等连续特征)
💻 C4.5关键改进代码片段:
class C45DecisionTree(ID3DecisionTree):
def __init__(self, max_depth=3):
super().__init__(max_depth)
def _information_value(self, X, feature):
"""计算特征的固有值IV(a)"""
values = np.unique(X[:, feature])
iv = 0
for value in values:
p = np.sum(X[:, feature] == value) / len(X)
iv -= p * np.log2(p) if p > 0 else 0
return iv
def _best_split(self, X, y, feature_types):
"""寻找最佳划分特征(支持连续特征)"""
best_gain_ratio = -1
best_feature = None
current_entropy = self._entropy(y)
for feature in range(X.shape[1]):
# 处理连续特征
if feature_types[feature] == 'continuous':
unique_values = np.unique(X[:, feature])
thresholds = (unique_values[:-1] + unique_values[1:]) / 2
best_threshold = None
best_sub_gain_ratio = -1
for threshold in thresholds:
left_indices = X[:, feature] <= threshold
right_indices = ~left_indices
if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
continue
left_entropy = self._entropy(y[left_indices])
right_entropy = self._entropy(y[right_indices])
gain = current_entropy - \
(len(y[left_indices])/len(y)) * left_entropy - \
(len(y[right_indices])/len(y)) * right_entropy
iv = self._information_value(np.column_stack((X[:, feature], y)), 0) # 简化计算
gain_ratio = gain / iv if iv > 0 else 0
if gain_ratio > best_sub_gain_ratio:
best_sub_gain_ratio = gain_ratio
best_threshold = threshold
if best_sub_gain_ratio > best_gain_ratio:
best_gain_ratio = best_sub_gain_ratio
best_feature = feature
best_threshold = best_threshold
# 处理离散特征
else:
values = np.unique(X[:, feature])
iv = self._information_value(X, feature)
best_value_gain = -1
for value in values:
left_indices = X[:, feature] == value
right_indices = ~left_indices
if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
continue
left_entropy = self._entropy(y[left_indices])
right_entropy = self._entropy(y[right_indices])
gain = current_entropy - \
(len(y[left_indices])/len(y)) * left_entropy - \
(len(y[right_indices])/len(y)) * right_entropy
gain_ratio = gain / iv if iv > 0 else 0
if gain_ratio > best_value_gain:
best_value_gain = gain_ratio
if best_value_gain > best_gain_ratio:
best_gain_ratio = best_value_gain
best_feature = feature
return best_feature, best_threshold if feature_types[best_feature] == 'continuous' else None
def _build_tree(self, X, y, feature_types, depth=0):
"""递归构建决策树(支持连续特征)"""
n_samples, n_features = X.shape
n_classes = len(np.unique(y))
if (depth >= self.max_depth or
n_samples < 2 or
n_classes == 1):
leaf_value = Counter(y).most_common(1)[0][0]
return {'leaf': True, 'value': leaf_value}
feature, threshold = self._best_split(X, y, feature_types)
if feature is None:
leaf_value = Counter(y).most_common(1)[0][0]
return {'leaf': True, 'value': leaf_value}
if feature_types[feature] == 'continuous':
# 连续特征处理
left_indices = X[:, feature] <= threshold
right_indices = ~left_indices
left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
return {
'leaf': False,
'feature': feature,
'threshold': threshold,
'left': left_subtree,
'right': right_subtree
}
else:
# 离散特征处理
values = np.unique(X[:, feature])
subtrees = {}
for value in values:
indices = X[:, feature] == value
subtrees[value] = self._build_tree(X[indices], y[indices], feature_types, depth+1)
return {
'leaf': False,
'feature': feature,
'subtrees': subtrees
}
# 使用示例(需要修改数据格式以包含连续特征)
# ...(此处省略具体实现,因为需要更复杂的数据预处理)
🎯 3. CART算法:万能的二叉树
定义:分类与回归树,用基尼指数生成二叉树。是一种强大的决策树算法,特点包括:
- 总是生成二叉树
- 分类使用基尼指数,回归使用平方误差
- 支持剪枝
原理:
特点:
- ✅ 二叉树结构
- ✅ 支持分类和回归
- ✅ 数值稳定性好
- ✅ 广泛用于集成学习(如随机森林)
示例场景:
- 房价预测(回归任务)
- 客户细分(分类任务)
💻 CART分类树代码实现:
class CARTClassifier:
def __init__(self, max_depth=3, min_samples_split=2):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.tree = None
def _gini(self, y):
"""计算基尼值"""
counts = np.bincount(y)
probs = counts / len(y)
return 1 - np.sum(probs ** 2)
def _best_split(self, X, y, feature_types):
"""寻找最佳划分特征(支持连续和离散特征)"""
best_gini_index = float('inf')
best_feature = None
best_threshold = None
current_gini = self._gini(y)
for feature in range(X.shape[1]):
# 处理连续特征
if feature_types[feature] == 'continuous':
unique_values = np.unique(X[:, feature])
thresholds = (unique_values[:-1] + unique_values[1:]) / 2
best_threshold_for_feature = None
best_sub_gini_index = float('inf')
for threshold in thresholds:
left_indices = X[:, feature] <= threshold
right_indices = ~left_indices
if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
continue
left_gini = self._gini(y[left_indices])
right_gini = self._gini(y[right_indices])
gini_index = (len(y[left_indices])/len(y)) * left_gini + \
(len(y[right_indices])/len(y)) * right_gini
if gini_index < best_sub_gini_index:
best_sub_gini_index = gini_index
best_threshold_for_feature = threshold
if best_sub_gini_index < best_gini_index:
best_gini_index = best_sub_gini_index
best_feature = feature
best_threshold = best_threshold_for_feature
# 处理离散特征(CART对离散特征也采用二分法)
else:
values = np.unique(X[:, feature])
# 尝试所有可能的二分方式
for i in range(len(values)):
for j in range(i+1, len(values)):
left_values = [values[i], values[j]]
left_indices = np.isin(X[:, feature], left_values)
right_indices = ~left_indices
if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
continue
left_gini = self._gini(y[left_indices])
right_gini = self._gini(y[right_indices])
gini_index = (len(y[left_indices])/len(y)) * left_gini + \
(len(y[right_indices])/len(y)) * right_gini
if gini_index < best_gini_index:
best_gini_index = gini_index
best_feature = feature
# 记录是哪些值分到左边(实际实现中可以更高效)
# 这里简化处理,实际需要更复杂的数据结构
return best_feature, best_threshold
def _build_tree(self, X, y, feature_types, depth=0):
"""递归构建决策树"""
n_samples, n_features = X.shape
n_classes = len(np.unique(y))
if (depth >= self.max_depth or
n_samples < self.min_samples_split or
n_classes == 1):
leaf_value = Counter(y).most_common(1)[0][0]
return {'leaf': True, 'value': leaf_value}
feature, threshold = self._best_split(X, y, feature_types)
if feature is None:
leaf_value = Counter(y).most_common(1)[0][0]
return {'leaf': True, 'value': leaf_value}
if feature_types[feature] == 'continuous':
# 连续特征处理
left_indices = X[:, feature] <= threshold
right_indices = ~left_indices
left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
return {
'leaf': False,
'feature': feature,
'threshold': threshold,
'left': left_subtree,
'right': right_subtree
}
else:
# 离散特征处理(简化版,实际需要记录哪些值分到左边)
# 这里为了简化,我们假设总是二分(实际CART对离散特征有多种二分方式)
# 更完整的实现需要更复杂的数据结构来记录分裂规则
values = np.unique(X[:, feature])
# 随机选择一个值作为分裂点(简化处理)
split_value = values[0]
left_indices = X[:, feature] == split_value
right_indices = ~left_indices
left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
return {
'leaf': False,
'feature': feature,
'split_value': split_value, # 简化处理
'left': left_subtree,
'right': right_subtree
}
def fit(self, X, y, feature_types):
"""训练决策树"""
self.tree = self._build_tree(X, y, feature_types)
def _predict_sample(self, x, tree):
"""预测单个样本"""
if tree['leaf']:
return tree['value']
feature = tree['feature']
if 'threshold' in tree: # 连续特征
if x[feature] <= tree['threshold']:
return self._predict_sample(x, tree['left'])
else:
return self._predict_sample(x, tree['right'])
else: # 离散特征(简化处理)
if x[feature] == tree['split_value']:
return self._predict_sample(x, tree['left'])
else:
return self._predict_sample(x, tree['right'])
def predict(self, X):
"""预测"""
return np.array([self._predict_sample(x, self.tree) for x in X])
# 示例使用
# 使用鸢尾花数据集(前两个特征,简化处理)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X = iris.data[:, :2] # 只使用前两个特征
y = iris.target
# 转换为二分类问题(简化示例)
y = (y == 0).astype(int) # 只区分setosa和其他
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
feature_types = ['continuous'] * 2 # 两个连续特征
cart = CARTClassifier(max_depth=3)
cart.fit(X_train, y_train, feature_types)
# 测试预测
test_sample = X_test[0]
prediction = cart.predict([test_sample])
print(f"预测结果: {'Setosa' if prediction[0] == 1 else 'Not Setosa'}")
📊 决策树可视化(使用scikit-learn)
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
# 使用scikit-learn的CART实现
clf = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
clf.fit(X_train, y_train)
plt.figure(figsize=(12, 8))
plot_tree(clf,
feature_names=iris.feature_names[:2],
class_names=['Not Setosa', 'Setosa'],
filled=True,
rounded=True)
plt.title("CART Decision Tree Visualization 🌲")
plt.show()
🤔 如何选择合适的决策树算法?
- 如果数据只有离散特征 🔢
- 简单任务:ID3
- 需要更稳定的结果:C4.5
- 如果数据包含连续特征 📈
- 选择C4.5或CART
- 如果需要二叉树结构 🌐
- 选择CART(广泛用于集成学习)
- 如果需要回归任务 📉
- 必须使用CART(回归树)
🚀 总结
今天我们学习了:
- ID3:信息增益的先驱,简单但有局限
- C4.5:ID3的升级版,支持连续特征和剪枝
- CART:强大的二叉树算法,支持分类和回归