决策树三剑客:CART、ID3、C4.5全解析(附代码)

大家好!上一篇文章我们介绍了什么是决策树(👇点击回顾),今天我们来聊聊决策树家族的三大经典算法:CART、ID3和C4.5。它们就像三位武林高手🥷,各有绝技,在机器学习江湖中占据重要地位。让我们一起来揭开它们的神秘面纱吧!

📚 决策树三种算法对比(速查表)

特性ID3C4.5CART
​适用场景​分类分类分类+回归
​分裂标准​信息增益增益率基尼指数/均方误差
​树结构​多叉树多叉树​二叉树​
​连续值处理​
​缺失值处理​
​主流工具​

​sklearn​

剪枝不支持支持支持
特征类型仅离散离散+连续离散+连续

📊 ​​三大决策树算法详解​

⚙️ ​​1. ID3算法:信息增益的奠基者​

​定义​​:是最早的决策树算法之一,核心思想是选择信息增益最大的特征进行划分。ID3 只能对离散属性的数据集构造决策树。
​原理​​:

  • H(D):数据集D的熵(混乱度)
  • D_v:按特征取值划分的子集
    ​特点​​:
  • 🚫 只能处理离散特征
  • 🚫 容易偏向取值多的特征
  • 🚫 不支持剪枝,容易过拟合

示例场景:

  • 天气预报(晴/雨/阴)
  • 水果分类(苹果/香蕉/橙子)

💻 ID3简易代码实现:

import numpy as np
from collections import Counter

class ID3DecisionTree:
    def __init__(self, max_depth=3):
        self.max_depth = max_depth
        self.tree = None
    
    def _entropy(self, y):
        """计算熵"""
        counts = np.bincount(y)
        probs = counts / len(y)
        return -np.sum([p * np.log2(p) for p in probs if p > 0])
    
    def _best_split(self, X, y, feature_types):
        """寻找最佳划分特征(仅离散特征)"""
        best_gain = -1
        best_feature = None
        current_entropy = self._entropy(y)
        
        for feature in range(X.shape[1]):
            if feature_types[feature] == 'continuous':  # ID3不支持连续特征
                continue
                
            values = np.unique(X[:, feature])
            for value in values:
                left_indices = X[:, feature] == value
                right_indices = ~left_indices
                
                if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                    continue
                
                left_entropy = self._entropy(y[left_indices])
                right_entropy = self._entropy(y[right_indices])
                
                gain = current_entropy - \
                      (len(y[left_indices])/len(y)) * left_entropy - \
                      (len(y[right_indices])/len(y)) * right_entropy
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
        
        return best_feature
    
    def _build_tree(self, X, y, feature_types, depth=0):
        """递归构建决策树"""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        if (depth >= self.max_depth or 
            n_samples < 2 or
            n_classes == 1):
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        feature = self._best_split(X, y, feature_types)
        if feature is None:
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        values = np.unique(X[:, feature])
        subtrees = {}
        for value in values:
            indices = X[:, feature] == value
            subtrees[value] = self._build_tree(X[indices], y[indices], feature_types, depth+1)
        
        return {
            'leaf': False,
            'feature': feature,
            'subtrees': subtrees
        }
    
    def fit(self, X, y, feature_types):
        """训练决策树"""
        self.tree = self._build_tree(X, y, feature_types)
    
    def _predict_sample(self, x, tree):
        """预测单个样本"""
        if tree['leaf']:
            return tree['value']
        
        feature_value = x[tree['feature']]
        if feature_value not in tree['subtrees']:
            # 默认选择最常见的类别
            return max(tree['subtrees'].values(), key=lambda x: x['value'] if x['leaf'] else 0)['value']
        
        return self._predict_sample(x, tree['subtrees'][feature_value])
    
    def predict(self, X):
        """预测"""
        return np.array([self._predict_sample(x, self.tree) for x in X])

# 示例使用
data = np.array([
    ['Sunny', 'Hot', 'High', 'Weak', 'No'],
    ['Sunny', 'Hot', 'High', 'Strong', 'No'],
    ['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Cool', 'Normal', 'Strong', 'No'],
    ['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
    ['Sunny', 'Mild', 'High', 'Weak', 'No'],
    ['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
    ['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
    ['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
    ['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
    ['Rain', 'Mild', 'High', 'Strong', 'No']
])

X = data[:, :-1]
y = np.array([1 if label == 'Yes' else 0 for label in data[:, -1]])
feature_types = ['discrete'] * 4  # 所有特征都是离散的

tree = ID3DecisionTree(max_depth=3)
tree.fit(X, y, feature_types)

# 测试预测
test_sample = np.array(['Sunny', 'Cool', 'High', 'Strong'])
prediction = tree.predict([test_sample])
print(f"预测结果: {'Yes' if prediction[0] == 1 else 'No'}")  # 输出: No

📏 ​​2. C4.5算法:信息增益率的优化版​

​定义​​:ID3的改进版,用​​信息增益率​​解决特征偏向问题。主要解决了ID3的几个问题:

  1. 支持连续特征(通过二分法)
  2. 使用信息增益率代替信息增益(避免偏好取值多的特征)
  3. 支持剪枝(防止过拟合)

​原理​​:

  • IV(A):特征固有值(惩罚多值特征)
    ​特点​​:
  • ✅ 支持连续特征
  • ✅ 使用信息增益率
  • ✅ 支持剪枝
  • ✅ 输出多叉树

示例场景:

  • 医疗诊断(症状+数值指标)
  • 信用评分(离散特征+收入等连续特征)

💻 C4.5关键改进代码片段:

class C45DecisionTree(ID3DecisionTree):
    def __init__(self, max_depth=3):
        super().__init__(max_depth)
    
    def _information_value(self, X, feature):
        """计算特征的固有值IV(a)"""
        values = np.unique(X[:, feature])
        iv = 0
        for value in values:
            p = np.sum(X[:, feature] == value) / len(X)
            iv -= p * np.log2(p) if p > 0 else 0
        return iv
    
    def _best_split(self, X, y, feature_types):
        """寻找最佳划分特征(支持连续特征)"""
        best_gain_ratio = -1
        best_feature = None
        current_entropy = self._entropy(y)
        
        for feature in range(X.shape[1]):
            # 处理连续特征
            if feature_types[feature] == 'continuous':
                unique_values = np.unique(X[:, feature])
                thresholds = (unique_values[:-1] + unique_values[1:]) / 2
                
                best_threshold = None
                best_sub_gain_ratio = -1
                
                for threshold in thresholds:
                    left_indices = X[:, feature] <= threshold
                    right_indices = ~left_indices
                    
                    if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                        continue
                    
                    left_entropy = self._entropy(y[left_indices])
                    right_entropy = self._entropy(y[right_indices])
                    
                    gain = current_entropy - \
                          (len(y[left_indices])/len(y)) * left_entropy - \
                          (len(y[right_indices])/len(y)) * right_entropy
                    
                    iv = self._information_value(np.column_stack((X[:, feature], y)), 0)  # 简化计算
                    gain_ratio = gain / iv if iv > 0 else 0
                    
                    if gain_ratio > best_sub_gain_ratio:
                        best_sub_gain_ratio = gain_ratio
                        best_threshold = threshold
                
                if best_sub_gain_ratio > best_gain_ratio:
                    best_gain_ratio = best_sub_gain_ratio
                    best_feature = feature
                    best_threshold = best_threshold
            
            # 处理离散特征
            else:
                values = np.unique(X[:, feature])
                iv = self._information_value(X, feature)
                
                best_value_gain = -1
                for value in values:
                    left_indices = X[:, feature] == value
                    right_indices = ~left_indices
                    
                    if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                        continue
                    
                    left_entropy = self._entropy(y[left_indices])
                    right_entropy = self._entropy(y[right_indices])
                    
                    gain = current_entropy - \
                          (len(y[left_indices])/len(y)) * left_entropy - \
                          (len(y[right_indices])/len(y)) * right_entropy
                    
                    gain_ratio = gain / iv if iv > 0 else 0
                    
                    if gain_ratio > best_value_gain:
                        best_value_gain = gain_ratio
                
                if best_value_gain > best_gain_ratio:
                    best_gain_ratio = best_value_gain
                    best_feature = feature
        
        return best_feature, best_threshold if feature_types[best_feature] == 'continuous' else None
    
    def _build_tree(self, X, y, feature_types, depth=0):
        """递归构建决策树(支持连续特征)"""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        if (depth >= self.max_depth or 
            n_samples < 2 or
            n_classes == 1):
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        feature, threshold = self._best_split(X, y, feature_types)
        if feature is None:
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        if feature_types[feature] == 'continuous':
            # 连续特征处理
            left_indices = X[:, feature] <= threshold
            right_indices = ~left_indices
            
            left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
            right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
            
            return {
                'leaf': False,
                'feature': feature,
                'threshold': threshold,
                'left': left_subtree,
                'right': right_subtree
            }
        else:
            # 离散特征处理
            values = np.unique(X[:, feature])
            subtrees = {}
            for value in values:
                indices = X[:, feature] == value
                subtrees[value] = self._build_tree(X[indices], y[indices], feature_types, depth+1)
            
            return {
                'leaf': False,
                'feature': feature,
                'subtrees': subtrees
            }

# 使用示例(需要修改数据格式以包含连续特征)
# ...(此处省略具体实现,因为需要更复杂的数据预处理)

🎯 ​​3. CART算法:万能的二叉树​

​定义​​:​​分类与回归树​​,用基尼指数生成二叉树。是一种强大的决策树算法,特点包括:

  1. 总是生成二叉树
  2. 分类使用基尼指数,回归使用平方误差
  3. 支持剪枝

​原理​​:

​特点​​:

  • ✅ 二叉树结构
  • ✅ 支持分类和回归
  • ✅ 数值稳定性好
  • ✅ 广泛用于集成学习(如随机森林)

示例场景:

  • 房价预测(回归任务)
  • 客户细分(分类任务)

💻 CART分类树代码实现:

class CARTClassifier:
    def __init__(self, max_depth=3, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.tree = None
    
    def _gini(self, y):
        """计算基尼值"""
        counts = np.bincount(y)
        probs = counts / len(y)
        return 1 - np.sum(probs ** 2)
    
    def _best_split(self, X, y, feature_types):
        """寻找最佳划分特征(支持连续和离散特征)"""
        best_gini_index = float('inf')
        best_feature = None
        best_threshold = None
        current_gini = self._gini(y)
        
        for feature in range(X.shape[1]):
            # 处理连续特征
            if feature_types[feature] == 'continuous':
                unique_values = np.unique(X[:, feature])
                thresholds = (unique_values[:-1] + unique_values[1:]) / 2
                
                best_threshold_for_feature = None
                best_sub_gini_index = float('inf')
                
                for threshold in thresholds:
                    left_indices = X[:, feature] <= threshold
                    right_indices = ~left_indices
                    
                    if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                        continue
                    
                    left_gini = self._gini(y[left_indices])
                    right_gini = self._gini(y[right_indices])
                    
                    gini_index = (len(y[left_indices])/len(y)) * left_gini + \
                                 (len(y[right_indices])/len(y)) * right_gini
                    
                    if gini_index < best_sub_gini_index:
                        best_sub_gini_index = gini_index
                        best_threshold_for_feature = threshold
                
                if best_sub_gini_index < best_gini_index:
                    best_gini_index = best_sub_gini_index
                    best_feature = feature
                    best_threshold = best_threshold_for_feature
            
            # 处理离散特征(CART对离散特征也采用二分法)
            else:
                values = np.unique(X[:, feature])
                
                # 尝试所有可能的二分方式
                for i in range(len(values)):
                    for j in range(i+1, len(values)):
                        left_values = [values[i], values[j]]
                        left_indices = np.isin(X[:, feature], left_values)
                        right_indices = ~left_indices
                        
                        if len(y[left_indices]) == 0 or len(y[right_indices]) == 0:
                            continue
                        
                        left_gini = self._gini(y[left_indices])
                        right_gini = self._gini(y[right_indices])
                        
                        gini_index = (len(y[left_indices])/len(y)) * left_gini + \
                                     (len(y[right_indices])/len(y)) * right_gini
                        
                        if gini_index < best_gini_index:
                            best_gini_index = gini_index
                            best_feature = feature
                            # 记录是哪些值分到左边(实际实现中可以更高效)
                            # 这里简化处理,实际需要更复杂的数据结构
        
        return best_feature, best_threshold
    
    def _build_tree(self, X, y, feature_types, depth=0):
        """递归构建决策树"""
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        if (depth >= self.max_depth or 
            n_samples < self.min_samples_split or
            n_classes == 1):
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        feature, threshold = self._best_split(X, y, feature_types)
        if feature is None:
            leaf_value = Counter(y).most_common(1)[0][0]
            return {'leaf': True, 'value': leaf_value}
        
        if feature_types[feature] == 'continuous':
            # 连续特征处理
            left_indices = X[:, feature] <= threshold
            right_indices = ~left_indices
            
            left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
            right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
            
            return {
                'leaf': False,
                'feature': feature,
                'threshold': threshold,
                'left': left_subtree,
                'right': right_subtree
            }
        else:
            # 离散特征处理(简化版,实际需要记录哪些值分到左边)
            # 这里为了简化,我们假设总是二分(实际CART对离散特征有多种二分方式)
            # 更完整的实现需要更复杂的数据结构来记录分裂规则
            values = np.unique(X[:, feature])
            # 随机选择一个值作为分裂点(简化处理)
            split_value = values[0]
            left_indices = X[:, feature] == split_value
            right_indices = ~left_indices
            
            left_subtree = self._build_tree(X[left_indices], y[left_indices], feature_types, depth+1)
            right_subtree = self._build_tree(X[right_indices], y[right_indices], feature_types, depth+1)
            
            return {
                'leaf': False,
                'feature': feature,
                'split_value': split_value,  # 简化处理
                'left': left_subtree,
                'right': right_subtree
            }
    
    def fit(self, X, y, feature_types):
        """训练决策树"""
        self.tree = self._build_tree(X, y, feature_types)
    
    def _predict_sample(self, x, tree):
        """预测单个样本"""
        if tree['leaf']:
            return tree['value']
        
        feature = tree['feature']
        if 'threshold' in tree:  # 连续特征
            if x[feature] <= tree['threshold']:
                return self._predict_sample(x, tree['left'])
            else:
                return self._predict_sample(x, tree['right'])
        else:  # 离散特征(简化处理)
            if x[feature] == tree['split_value']:
                return self._predict_sample(x, tree['left'])
            else:
                return self._predict_sample(x, tree['right'])
    
    def predict(self, X):
        """预测"""
        return np.array([self._predict_sample(x, self.tree) for x in X])

# 示例使用
# 使用鸢尾花数据集(前两个特征,简化处理)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data[:, :2]  # 只使用前两个特征
y = iris.target

# 转换为二分类问题(简化示例)
y = (y == 0).astype(int)  # 只区分setosa和其他

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
feature_types = ['continuous'] * 2  # 两个连续特征

cart = CARTClassifier(max_depth=3)
cart.fit(X_train, y_train, feature_types)

# 测试预测
test_sample = X_test[0]
prediction = cart.predict([test_sample])
print(f"预测结果: {'Setosa' if prediction[0] == 1 else 'Not Setosa'}")

 📊 决策树可视化(使用scikit-learn)

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# 使用scikit-learn的CART实现
clf = DecisionTreeClassifier(max_depth=3, criterion='gini', random_state=42)
clf.fit(X_train, y_train)

plt.figure(figsize=(12, 8))
plot_tree(clf, 
          feature_names=iris.feature_names[:2], 
          class_names=['Not Setosa', 'Setosa'],
          filled=True, 
          rounded=True)
plt.title("CART Decision Tree Visualization 🌲")
plt.show()

🤔 如何选择合适的决策树算法?

  1. 如果数据只有离散特征 🔢
    • 简单任务:ID3
    • 需要更稳定的结果:C4.5
  2. 如果数据包含连续特征 📈
    • 选择C4.5或CART
  3. 如果需要二叉树结构 🌐
    • 选择CART(广泛用于集成学习)
  4. 如果需要回归任务 📉
    • 必须使用CART(回归树)

🚀 总结

今天我们学习了:

  1. ID3:信息增益的先驱,简单但有局限
  2. C4.5:ID3的升级版,支持连续特征和剪枝
  3. CART:强大的二叉树算法,支持分类和回归

拓展阅读:

1、机器学习 vs 神经网络:程序员版「手冲咖啡 vs 全自动咖啡机」大作战​

2、一招搞定分类问题!决策树算法原理与实战详解(附Python代码)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值