机器学习实战—— Chap03.Decision Tree

最新推荐文章于 2025-06-01 18:19:02 发布

Ivan-Zhang

最新推荐文章于 2025-06-01 18:19:02 发布

阅读量198

点赞数

分类专栏： MachineLearning 文章标签： Python ML

本文链接：https://blog.youkuaiyun.com/weixin_41728431/article/details/103103724

版权

本文详细探讨了机器学习中决策树的实战应用，涵盖了算法原理、构建过程及Python实现。通过实例解析，展示了如何利用决策树进行分类，并讨论了其优缺点和适用场景。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

"""
用字典存储决策树结构：
{'有自己的房子':{0:{'有工作':{0:'no', 1:'yes'}}, 1:'yes'}}
年龄：0代表青年，1代表中年，2代表老年
有工作：0代表否，1代表是
有自己的房子：0代表否，1代表是
信贷情况：0代表一般，1代表好，2代表非常好
类别（是否给贷款）：no代表否，yes代表是
pickle包可以将决策树保存下来，方便下次直接调用
"""
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
from math import log
import operator
import pickle

"""
Desc:
    创建测试数据集
    
Parameters:
    None
    
Returns:
    dataSet - 数据集
    labels - 分类属性
"""
def createDataSet():
    # 数据集
    dataSet = [[0, 0, 0, 0, 'no'],
               [0, 0, 0, 1, 'no'],
               [0, 1, 0, 1, 'yes'],
               [0, 1, 1, 0, 'yes'],
               [0, 0, 0, 0, 'no'],
               [1, 0, 0, 0, 'no'],
               [1, 0, 0, 1, 'no'],
               [1, 1, 1, 1, 'yes'],
               [1, 0, 1, 2, 'yes'],
               [1, 0, 1, 2, 'yes'],
               [2, 0, 1, 2, 'yes'],
               [2, 0, 1, 1, 'yes'],
               [2, 1, 0, 1, 'yes'],
               [2, 1, 0, 2, 'yes'],
               [2, 0, 0, 0, 'no']]
    # 分类属性
    labels = ['年龄', '有工作', '有自己的房子', '信贷情况']
    # 返回数据集和分类属性
    return dataSet, labels


"""
Desc:
    计算给定数据集的经验熵（香农熵）
        Ent（D） = -SUM（kp*Log2（kp））
        
Parameters:
    dataSet - 数据集
    
Returns:
    shannonEnt - 经验熵（香农熵）
"""
def calcShannonEnt(dataSet):
    # 返回数据集的行数
    numEntires = len(dataSet)
    # 保存每个标签（Label）出现次数的“字典”
    labelCounts = {
   }
    # 对每组特征向量进行统计
    for featVec in dataSet:
        # 提取标签（Label）信息
        currentLabel = featVec[-1]
        # 如果标签（Label）没有放入统计次数的字典，添加进去
        if currentLabel not in labelCounts.keys():
            # 创建一个新的键值对，键为currentLabel值为0
            labelCounts[currentLabel] = 0
        # Label计数
        labelCounts[currentLabel] += 1
    # 经验熵（香农熵）
    shannonEnt = 0.0
    # 计算香农熵
    for key in labelCounts:
        # 选择该标签（Label）的概率
        prob = float(labelCounts[key]) / numEntires
        # 利用公式计算
        shannonEnt -= prob*log(prob, 2)
    # 返回经验熵（香农熵）
    return shannonEnt

"""
Desc:
    按照给定特征划分数据集
    
Parameters:
    dataSet - 待划分的数据集
    axis - 划分数据集的特征
    values - 需要返回的特征的值
    
Returns:
    None
"""
def splitDataSet(dataSet, axis, value):
    # 创建返回的数据集列表
    retDataSet = []
    # 遍历数据集的每一行
    for featVec in dataSet:
        if featVec[axis] == value

最低0.47元/天解锁文章