决策树三剑客：CART、ID3、C4.5全解析（附代码）-优快云博客

本文链接：https://blog.youkuaiyun.com/qiutesting/article/details/148590623

大家好！上一篇文章我们介绍了什么是决策树（👇点击回顾），今天我们来聊聊决策树家族的三大经典算法：CART、ID3和C4.5。它们就像三位武林高手🥷，各有绝技，在机器学习江湖中占据重要地位。让我们一起来揭开它们的神秘面纱吧！

📚 决策树三种算法对比（速查表）

特性	ID3	C4.5	CART
适用场景	分类	分类	分类+回归
分裂标准	信息增益	增益率	基尼指数/均方误差
树结构	多叉树	多叉树	二叉树
连续值处理	❌	✅	✅
缺失值处理	❌	✅	✅
主流工具	少	少	sklearn
剪枝	不支持	支持	支持
特征类型	仅离散	离散+连续	离散+连续

📊 三大决策树算法详解

⚙️ 1. ID3算法：信息增益的奠基者

定义：是最早的决策树算法之一，核心思想是选择信息增益最大的特征进行划分。ID3 只能对离散属性的数据集构造决策树。
原理：

H(D)：数据集D的熵（混乱度）
D_v：按特征取值划分的子集
特点：
🚫 只能处理离散特征
🚫 容易偏向取值多的特征
🚫 不支持剪枝，容易过拟合

示例场景：

天气预报（晴/雨/阴）
水果分类（苹果/香蕉/橙子）

💻 ID3简易代码实现：

import numpy as np
from collections import Counter

class ID3DecisionTree:
    def __init__(self, max_depth=3):
        self.max_depth = max_depth
        self.tree = None
    
    def _entropy(self, y):
        """计算熵"""
        counts = np.bincount(y)
        probs = counts / len(y)
        return -np.sum([p * np.log2(p) for p in probs if p > 0])
    
    def _best_split(self, X, y, feature_types):
        """寻找最佳划分特征（仅离散特征）"""
        best_gain = -1
        best_feature = None
        current_entropy = self._entropy(y)
        
        for feature in range(X.shape[1]):
            if feature_types[feature] == 'continuous':  # ID3不支持连续特征
                continue
                
            values = np.unique(X[:, feature])
            for value in values:
                left_indices = X[:, feature] == value
                right_indices = ~left_indices
                
                if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                    continue
                
                left_entropy = self._entropy(y[left_indices])
                right_entropy = self._entropy(y[right_indices])
                
                gain = current_entropy - \
                      (len(y[left_indices])/len(y)) * left_entropy - \
                      (len(y[right_indices])/len(y)) * right_entropy
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
        
        return best_feature
    
    def _build_tree(self, X, y, feature_types, depth=0):
        """递归构建决策树"""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        if (depth >= self.max_depth or 
            n_samples < 2 or
            n_classes == 1):
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        feature = self._best_split(X, y, feature_types)
        if feature is None:
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        values = np.unique(X[:, feature])
        subtrees = {}
        for value in values:
            indices = X[:, feature] == value
            subtrees[value] = self._build_tree(X[indices], y[indices], feature_types, depth+1)
        
        return {
            'leaf': False,
            'feature': feature,
            'subtrees': subtrees
        }
    
    def fit(self, X, y, feature_types):
        """训练决策树"""
        self.tree = self._build_tree(X, y, feature_types)
    
    def _predict_sample(self, x, tree):
        """预测单个样本"""
        if tree['leaf']:
            return tree['value']
        
        feature_value = x[tree['feature']]
        if feature_value not in tree['subtrees']:
            # 默认选择最常见的类别
            return max(tree['subtrees'].values(), key=lambda x: x['value'] if x['leaf'] else 0)['value']
        
        return self._predict_sample(x, tree['subtrees'][feature_value])
    
    def predict(self, X):
        """预测"""
        return np.array([self._predict_sample(x, self.tree) for x in X])

# 示例使用
data = np.array([
    ['Sunny', 'Hot', 'High', 'Weak', 'No'],
    ['Sunny', 'Hot', 'High', 'Strong', 'No'],
    ['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny', 'Mild', 'High', 'Weak', 'No'],
    ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
    ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
    ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Strong', 'No']
])

X = data[:, :-1]
y = np.array([1 if label == 'Yes' else 0 for label in data[:, -1]])
feature_types = ['discrete'] * 4  # 所有特征都是离散的

tree = ID3DecisionTree(max_depth=3)
tree.fit(X, y, feature_types)

# 测试预测
test_sample = np.array(['Sunny', 'Cool', 'High', 'Strong'])
prediction = tree.predict([test_sample])
print(f"预测结果: {'Yes' if prediction[0] == 1 else 'No'}")  # 输出: No

📏 2. C4.5算法：信息增益率的优化版

定义：ID3的改进版，用信息增益率解决特征偏向问题。主要解决了ID3的几个问题：

支持连续特征（通过二分法）
使用信息增益率代替信息增益（避免偏好取值多的特征）
支持剪枝（防止过拟合）

原理：

IV(A)：特征固有值（惩罚多值特征）
特点：
✅ 支持连续特征
✅ 使用信息增益率
✅ 支持剪枝
✅ 输出多叉树

示例场景：

医疗诊断（症状+数值指标）
信用评分（离散特征+收入等连续特征）

💻 C4.5关键改进代码片段：

class C45DecisionTree(ID3DecisionTree):
    def __init__(self, max_depth=3):
        super().__init__(max_depth)
    
    def _information_value(self, X, feature):
        """计算特征的固有值IV(a)"""
        values = np.unique(X[:, feature])
        iv = 0
        for value in values:
            p = np.sum(X[:, feature] == value) / len(X)
            iv -= p * np.log2(p) if p > 0 else 0
        return iv
    
    def _best_split(self, X, y, feature_types):
        """寻找最佳划分特征（支持连续特征）"""
        best_gain_ratio = -1
        best_feature = None
        current_entropy = self._entropy(y)
        
        for feature in range(X.shape[1]):
            # 处理连续特征
            if feature_types[feature] == 'continuous':
                unique_values = np.unique(X[:, feature])
                thresholds = (unique_values[:-1] + unique_values[1:]) / 2
                
                best_threshold = None
                best_sub_gain_ratio = -1
                
                for threshold in thresholds:
                    left_indices = X[:, feature] <= threshold
                    right_indices = ~left_indices
                    
                    if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                        continue
                    
                    left_entropy = self._entropy(y[left_indices])
                    right_entropy = self._entropy(y[right_indices])
                    
                    gain = current_entropy - \
                          (len(y[left_indices])/len(y)) * left_entropy - \
                          (len(y[right_indices])/len(y)) * right_entropy
                    
                    iv = self._information_value(np.column_stack((X[:, feature], y)), 0)  # 简化计算
                    gain_ratio = gain / iv if iv > 0 else 0
                    
                    if gain_ratio > best_sub_gain_ratio:
                        best_sub_gain_ratio = gain_ratio
                        best_threshold = threshold
                
                if best_sub_gain_ratio > best_gain_ratio:
                    best_gain_ratio = best_sub_gain_ratio
                    best_feature = feature
                    best_threshold = best_threshold
            
            # 处理离散特征
            else:
                values = np.unique(X[:, feature])
                iv = self._information_value(X, feature)
                
                best_value_gain = -1
                for value in values:
                    left_indices = X[:, feature] == value
                    right_indices = ~left_indices
                    
                    if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                        continue
                    
                    left_entropy = self._entropy(y[left_indices])
                    right_entropy = self._entropy(y[right_indices])
                    
                    gain = current_entropy - \
                          (len(y[left_indices])/len(y)) * left_entropy - \
                          (len(y[right_indices])/len(y)) * right_entropy
                    
                    gain_ratio = gain / iv if iv > 0 else 0
                    
                    if gain_ratio > best_value_gain:
                        best_value_gain = gain_ratio
                
                if best_value_gain > best_gain_ratio:
                    best_gain_ratio = best_value_gain
                    best_feature = feature
        
        return best_feature, best_threshold if feature_types[best_feature] == 'continuous' else None
    
    def _build_tree(self, X, y, feature_types, depth=0):
        """递归构建决策树（支持连续特征）"""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        if (depth >= self.max_depth or 
            n_samples < 2 or
            n_classes == 1):
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        feature, threshold = self._best_split(X, y, feature_types)
        if feature is None:
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        if feature_types[feature] == 'continuous':
            # 连续特征处理
            left_indices = X[:, feature] <= threshold
            right_indices = ~left_indices
            
            left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
            right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
            
            return {
                'leaf': False,
                'feature': feature,
                'threshold': threshold,
                'left': left_subtree,
                'right': right_subtree
            }
        else:
            # 离散特征处理
            values = np.unique(X[:, feature])
            subtrees = {}
            for value in values:
                indices = X[:, feature] == value
                subtrees[value] = self._build_tree(X[indices], y[indices], feature_types, depth+1)
            
            return {
                'leaf': False,
                'feature': feature,
                'subtrees': subtrees
            }

# 使用示例（需要修改数据格式以包含连续特征）
# ...（此处省略具体实现，因为需要更复杂的数据预处理）

🎯 3. CART算法：万能的二叉树

定义：分类与回归树，用基尼指数生成二叉树。是一种强大的决策树算法，特点包括：

总是生成二叉树
分类使用基尼指数，回归使用平方误差
支持剪枝

原理：

特点：

✅ 二叉树结构
✅ 支持分类和回归
✅ 数值稳定性好
✅ 广泛用于集成学习（如随机森林）

示例场景：

房价预测（回归任务）
客户细分（分类任务）

💻 CART分类树代码实现：

class CARTClassifier:
    def __init__(self, max_depth=3, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.tree = None
    
    def _gini(self, y):
        """计算基尼值"""
        counts = np.bincount(y)
        probs = counts / len(y)
        return 1 - np.sum(probs ** 2)
    
    def _best_split(self, X, y, feature_types):
        """寻找最佳划分特征（支持连续和离散特征）"""
        best_gini_index = float('inf')
        best_feature = None
        best_threshold = None
        current_gini = self._gini(y)
        
        for feature in range(X.shape[1]):
            # 处理连续特征
            if feature_types[feature] == 'continuous':
                unique_values = np.unique(X[:, feature])
                thresholds = (unique_values[:-1] + unique_values[1:]) / 2
                
                best_threshold_for_feature = None
                best_sub_gini_index = float('inf')
                
                for threshold in thresholds:
                    left_indices = X[:, feature] <= threshold
                    right_indices = ~left_indices
                    
                    if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                        continue
                    
                    left_gini = self._gini(y[left_indices])
                    right_gini = self._gini(y[right_indices])
                    
                    gini_index = (len(y[left_indices])/len(y)) * left_gini + \
                                 (len(y[right_indices])/len(y)) * right_gini
                    
                    if gini_index < best_sub_gini_index:
                        best_sub_gini_index = gini_index
                        best_threshold_for_feature = threshold
                
                if best_sub_gini_index < best_gini_index:
                    best_gini_index = best_sub_gini_index
                    best_feature = feature
                    best_threshold = best_threshold_for_feature
            
            # 处理离散特征（CART对离散特征也采用二分法）
            else:
                values = np.unique(X[:, feature])
                
                # 尝试所有可能的二分方式
                for i in range(len(values)):
                    for j in range(i+1, len(values)):
                        left_values = [values[i], values[j]]
                        left_indices = np.isin(X[:, feature], left_values)
                        right_indices = ~left_indices
                        
                        if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                            continue
                        
                        left_gini = self._gini(y[left_indices])
                        right_gini = self._gini(y[right_indices])
                        
                        gini_index = (len(y[left_indices])/len(y)) * left_gini + \
                                     (len(y[right_indices])/len(y)) * right_gini
                        
                        if gini_index < best_gini_index:
                            best_gini_index = gini_index
                            best_feature = feature
                            # 记录是哪些值分到左边（实际实现中可以更高效）
                            # 这里简化处理，实际需要更复杂的数据结构
        
        return best_feature, best_threshold
    
    def _build_tree(self, X, y, feature_types, depth=0):
        """递归构建决策树"""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        if (depth >= self.max_depth or 
            n_samples < self.min_samples_split or
            n_classes == 1):
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        feature, threshold = self._best_split(X, y, feature_types)
        if feature is None:
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        if feature_types[feature] == 'continuous':
            # 连续特征处理
            left_indices = X[:, feature] <= threshold
            right_indices = ~left_indices
            
            left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
            right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
            
            return {
                'leaf': False,
                'feature': feature,
                'threshold': threshold,
                'left': left_subtree,
                'right': right_subtree
            }
        else:
            # 离散特征处理（简化版，实际需要记录哪些值分到左边）
            # 这里为了简化，我们假设总是二分（实际CART对离散特征有多种二分方式）
            # 更完整的实现需要更复杂的数据结构来记录分裂规则
            values = np.unique(X[:, feature])
            # 随机选择一个值作为分裂点（简化处理）
            split_value = values[0]
            left_indices = X[:, feature] == split_value
            right_indices = ~left_indices
            
            left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
            right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
            
            return {
                'leaf': False,
                'feature': feature,
                'split_value': split_value,  # 简化处理
                'left': left_subtree,
                'right': right_subtree
            }
    
    def fit(self, X, y, feature_types):
        """训练决策树"""
        self.tree = self._build_tree(X, y, feature_types)
    
    def _predict_sample(self, x, tree):
        """预测单个样本"""
        if tree['leaf']:
            return tree['value']
        
        feature = tree['feature']
        if 'threshold' in tree:  # 连续特征
            if x[feature] <= tree['threshold']:
                return self._predict_sample(x, tree['left'])
            else:
                return self._predict_sample(x, tree['right'])
        else:  # 离散特征（简化处理）
            if x[feature] == tree['split_value']:
                return self._predict_sample(x, tree['left'])
            else:
                return self._predict_sample(x, tree['right'])
    
    def predict(self, X):
        """预测"""
        return np.array([self._predict_sample(x, self.tree) for x in X])

# 示例使用
# 使用鸢尾花数据集（前两个特征，简化处理）
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data[:, :2]  # 只使用前两个特征
y = iris.target

# 转换为二分类问题（简化示例）
y = (y == 0).astype(int)  # 只区分setosa和其他

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
feature_types = ['continuous'] * 2  # 两个连续特征

cart = CARTClassifier(max_depth=3)
cart.fit(X_train, y_train, feature_types)

# 测试预测
test_sample = X_test[0]
prediction = cart.predict([test_sample])
print(f"预测结果: {'Setosa' if prediction[0] == 1 else 'Not Setosa'}")

📊 决策树可视化（使用scikit-learn）

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# 使用scikit-learn的CART实现
clf = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
clf.fit(X_train, y_train)

plt.figure(figsize=(12, 8))
plot_tree(clf, 
          feature_names=iris.feature_names[:2], 
          class_names=['Not Setosa', 'Setosa'],
          filled=True, 
          rounded=True)
plt.title("CART Decision Tree Visualization 🌲")
plt.show()