决策树

最新推荐文章于 2023-10-25 16:20:16 发布

原创最新推荐文章于 2023-10-25 16:20:16 发布 · 221 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#决策树 #机器学习 #DecisionTree

机器学习专栏收录该内容

3 篇文章

订阅专栏

本文深入讲解决策树算法的原理，包括信息增益的概念及其在决策树构建中的应用，通过实例展示了决策树的构建过程，从数据集的划分到树的递归构建，再到决策树的绘制和预测应用。

决策树

原理：
当前数据集上哪个特征在划分数据分类时起决定性作用，为了找到决定性的特征，划分出最好的结果，必须评估每个特征。完成测试之后，原始数据集就被划分为几个数据子集。
这些数据子集分布在第一个决策点的所有分支上。如果某个分支下的数据属于同一类别，则数据已经正确划分无需进一步对数据集进行分割。如果数据子集内的数据不属于同一类型，则需要重复划分数据子集的过程。
划分数据子集的算法和划分原始数据集的方法相同，直到所有具有相同类型的数据均在一个数据子集内。

优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据
缺点：可能会产生过度匹配问题
适用数据类型：数值型和标称型

决策树的一般流程

收集数据：可以使用任方法
准备数据：树构造算法只适用于标称型数据，因此数值型数据必须离散化
分析数据：可以使用任何方法，构造树完成之后，我们应该检查图形是否符合预期
训练算法：构造树的数据结构
测试算法：使用经验树计算错误率
使用算法：此步骤可以适用于任何监督学习算法，而使用决策树可以更好地理解数据的内在含义

信息增益

划分数据集的大原则：将无序的数据变得更加有序。

划分数据集之前之后信息发生的变化称为信息增益。

计算每个特征值划分数据集获得的信息增益，这个信息增益最高的特征就是最好的选择。

故此，信息增益熵可以评测哪种数据划分方式是最好的。

熵定义为信息的期望值，如果待分类的事务可能划分在多个分类之中，则符号 $x_{i}$ 的信息定义为： $l(x_{i})=-log_{2}p(x_{i})$
计算熵，需要计算所有类别所有可能值包含的信息期望值，公式： $H=E[l(x_{i})]=-\sum_{i=1}^{n}p(x_{i})\cdot log_{2}p(x_{i})$

import numpy as np
import matplotlib.pyplot as plt
import operator
%matplotlib inline

# 计算给定数据集的香农熵
from math import log

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)  # 求数据集大小
    labelCounts = {}   #新建一个用于统计各个类型数目的字典
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0   #初始化香农熵
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries  #计算概率，古典概型
        shannonEnt -= prob * log(prob,2)  #根据公式计算香农熵
    return shannonEnt

def createDataSet():
    dataSet = [[1,1,'yes'],
              [1,1,'yes'],
              [1,0,'no'],
              [0,1,'no']]
    labels= ['no surfacing','flippers']
    return dataSet,labels

myDat, labels = createDataSet()
print(myDat)
len(myDat[0])

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no']]

3

calcShannonEnt(myDat)

1.0

# 熵越高，则混合的数据类型也越多
myDat[0][-1]='maybe'
print(myDat)
calcShannonEnt(myDat)  # 添加了一类数据 如maybe 使得熵增大

[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no']]

1.5

划分数据集

如何划分数据集？
如何度量信息增益？

根据信息增益最大原则，选择最好的数据集划分方式

特征 $A$ 对训练集 $D$ 的信息增益 $g (D, A)$ ，定义为集合 $D$ 的经验熵 $H (D)$ 与特征 $A$ 给定条件下 $D$ 的经验条件熵 $H (D ∣ A)$ 之差，即
$g (D, A) = H (D) - H (D ∣ A)$

设训练数据集为 $D$ ， $∣ D ∣$ 表示其样本容量，即样本个数。设有 $K$ 个类 $C_k$ ， $k = 1, 2, \dots, K$ ， $C_k|$ 为属于类 $C_k$ 的样本个数， $\sum_{k=1}^K|C_k|=|D|$ 。
设特征 $A$ 有 $n$ 个不同的取值 ${a_1,a_2,…,a_n}$ ，根据特征 $A$ 的取值将 $D$ 划分为 $n$ 个子集 $D_1,D_2,…,D_n$ ， $D_i|$ 为 $D_i$ 的样本个数， $\sum_{i=1}^n|D_i|=|D|$ 。
记子集 $D_i$ 中属于类 $C_k$ 的样本的集合为 $D_ik$ ，即 $D_{ik}=D_i\bigcap C_k$ ， $D_{ik}|$ 为 $D_{ik}$ 的样本个数，于是信息增益的算法如下：
输入：训练数据集 $D$ 和特征 $A$ ；
输出：特征 $A$ 对训练数据集 $D$ 的信息增益 $g (D, A)$ 。
（1）计算数据集 $D$ 的经验熵 $H (D)$
$-\sum_{k=1}^K\frac{|C_k|}{|D|}log_2\frac{|C_k|}{|D|}$
（2）计算特征 $A$ 对数据集 $D$ 的经验条件熵 $H (D ∣ A)$
$\sum_{i=1}^n\frac{|D_i|}{|D|}H(D_i)=-\sum_{i=1}^n\frac{|D_i|}{|D|}\sum_{k=1}^K\frac{|D_{ik}|}{|D_i|}log_2\frac{|D_{ik}|}{|D_i|}$
（3）计算信息增益
$g (D, A) = H (D) - H (D ∣ A)$

# 按照给定特征划分数据集
def splitDataSet(dataSet,axis,value):
    """
    按给定特征划分数据集
    参数：
        dataSet -- 待划分的数据集
        axis -- 划分的参照特征
        value -- 分界线的特征值
    """ 
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:  # axis代表选取的特征列索引，value代表特征的某一取值
            #取之前的特征，不包括参照特征
            reducedFeatVec = featVec[:axis]
#             print('splitDataSet特征前',reducedFeatVec)
             #取之后的特征，不包括参照特征
            reducedFeatVec.extend(featVec[axis+1:])
#             print('splitDataSet 特征后',reducedFeatVec)
            #添加到划分数据集中
            retDataSet.append(reducedFeatVec)
    return retDataSet

print(splitDataSet(myDat,0,1))  # myDat第一个特征为1的划分 
print(splitDataSet(myDat,0,0))  # myDat第一个特征为0的划分 这样依照第一元素特征 0 1  将数据集划分两个部分

[[1, 'yes'], [1, 'yes'], [0, 'no']]
[[1, 'no']]

# 选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
    """
    选择最好的划分参照特征，也就是使得信息增益最大的划分方式
    参数：
        dataSet -- 数据集
    返回：
        bestFeature -- 划分最好的数据集的特征的索引值
    """
    # 求特征数，-1是去掉标签
    numFeatures = len(dataSet[0]) - 1
    # 求数据集的香农熵
    baseEntropy = calcShannonEnt(dataSet)
    # 初始化最佳信息增益，和最好的参照特征
    bestInfoGain = 0.0
    bestFeature = -1
    # 遍历每个特征
    for i in range(numFeatures):
        # 取每个数据的第i个特征值
        featList = [example[i] for example in dataSet]
        # 把这些特征值作为一个集合
        uniqueVals = set(featList)
        # 初始化划分后的熵
        newEntropy = 0.0
        # 遍历集合中的特征值
        for value in uniqueVals:
            # 取第i个特征值为value子集
            subDataSet = splitDataSet(dataSet, i, value)
            # 计算概率
            prob = len(subDataSet) / np.float(len(dataSet))
            # 计算划分后的熵，为各个子集熵的期望
            newEntropy += prob * calcShannonEnt(subDataSet)
        # 计算信息增益
        infoGain = baseEntropy - newEntropy
        # 如果当前以i特征划分得到的信息增益大于之前的最大值
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature  # 返回最佳特征列索引

myDat, labels = createDataSet()
chooseBestFeatureToSplit(myDat)
myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no']]

递归构建决策树

def majorityCnt(classList):
    """
    求出现次数最多分类的名称
    参数：
        classList -- 类列表
    返回：
        sortedClassCount[0][0] -- 最多分类的名称
    """
    classCount={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

# 创建树的函数代码
def createTree(dataSet,labels):
    """
    创建决策树的递归函数
    参数：
        dataSet -- 数据集
        labels -- 标签
    返回：
        myTree -- 创建的决策树
    """
    classList = [example[-1] for example in dataSet] # 取出类别列
    print('classList:',classList)
    if classList.count(classList[0]) == len(classList):  # 类标签相同，只有1个类别时停止分割，返回当前类别
        return classList[0]
    if len(dataSet[0]) == 1:  # 使用完了所有特征，没有更多特征时停止分割，返回实例中数量最多的类
        return majorityCnt(classList)   # 返回出现次数最多的类标
    
    bestFeat = chooseBestFeatureToSplit(dataSet)   # 最佳特征对应的列号
    bestFeatLabel = labels[bestFeat]  # 最佳特征名称/说明
#     print('bestFeatLabel:',bestFeatLabel)
    myTree = {bestFeatLabel:{}}  # 嵌套创建树
    del(labels[bestFeat])  # 从特征名中删除掉已经选为最佳特征的
#     print('del labels:',labels)
    featValues = [example[bestFeat] for example in dataSet]   # 最佳特征数据列
#     print('featValues:',featValues)
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]   # 拷贝特征名，避免搞乱原有值
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)  # 以最佳的特征值 继续构建树
    return myTree

myDat,labels=createDataSet()
myTree = createTree(myDat,labels)
myTree

classList: ['yes', 'yes', 'no', 'no']
classList: ['no']
classList: ['yes', 'yes', 'no']
classList: ['no']
classList: ['yes', 'yes']


{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

Matplotlib注解绘制树形图

Matplotlib 注解

import matplotlib.pyplot as plt

# 使用文本注解绘制树节点
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")

def plotNode(nodeTxt,centerPt,parentPt,nodeType):
    """
    绘制节点
    参数：
        nodeTxt -- 点的内容
        centerPt -- 中心点
        parentPt -- 父节点
        nodeType -- 点类型
    返回：
        无
    """
    createPlot.ax1.annotate(nodeTxt, xy = parentPt, \
        xycoords = 'axes fraction', xytext = centerPt, \
        textcoords = 'axes fraction', va = "center", \
        ha = "center", bbox = nodeType, arrowprops = arrow_args)  
    
    
def createPlot():
    # 进行绘制
    fig = plt.figure(1, facecolor = 'white')
    fig.clf()
    createPlot.ax1 = plt.subplot(111, frameon = False)
    plotNode('decisionNode', (0.5, 0.1), (0.1, 0.5), decisionNode)
    plotNode('leafNode', (0.8, 0.1), (0.3, 0.8), leafNode)
    plt.show()

createPlot()

在这里插入图片描述

构造注解树

怎么确定有多少个叶节点，以确定x长度？
怎么知道树有多少层，以确定y轴高度？

# 获取叶节点的数目和树的层数
def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]  # 根节点对应的名称
    secondDict = myTree[firstStr]  # 根节点对应的字典
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ =='dict':
            thisDepth = 1 +getNumLeafs(secondDict[key])
        else:
            thisDepth = 1
        if thisDepth > numLeafs:
            numLeafs = thisDepth
    return numLeafs
def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in list(secondDict.keys()):
        if type(secondDict[key]).__name__ == 'dict':
            #深度 = 1 + 子树的深度
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:
            thisDepth = 1
        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return maxDepth

def retrieveTree(i):
    listOfTrees = [{'no surfacing':{0:'no',1:{'flippers':\
                                             {0:'no',1:'yes'}}}},
                  {'no surfacing':{0:'no',1:{'flippers':
                                            {0:{'head':{0:'no',1:'yes'}},1:'no'}}}}]
    return listOfTrees[i]

print("retrieveTree(1):",retrieveTree(1))
myTree = retrieveTree(0)
print("叶节点数:",getNumLeafs(myTree))
print("层数 :",getTreeDepth(myTree))

retrieveTree(1): {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
叶节点数: 2
层数 : 2

def plotMidText(cntrPt, parentPt, txtString):
    """
    绘制注释
    参数：
        cntrPt -- 中心点位置
        parentPt -- 父节点位置
        txtString -- 文本字符串
    返回：
        无
    """
    #绘制注释的坐标位置
    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
    #绘制注释
    createPlot.ax1.text(xMid, yMid, txtString)

def plotTree(myTree, parentPt, nodeTxt):
    """
    绘制决策树
    参数：
        myTree -- 决策树
        parentPt -- 父节点位置
        nodeTxt -- 节点文本信息
    返回：
        无
    """
    #求树的叶节点数和层数
    numLeafs = getNumLeafs(myTree)
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]
    cntrPt = (plotTree.xOff + (1.0 + np.float(numLeafs)) / 2.0 \
             / plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
    for key in list(secondDict.keys()):
        if type(secondDict[key]).__name__ == 'dict':
            plotTree(secondDict[key], cntrPt, str(key))
        else:
            plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff),\
                    cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD

def createPlotTree(inTree):
    """
    进行绘制
    参数：
        inTree -- 决策树
    返回：
        无
    """
    fig = plt.figure(1, facecolor = 'white')
    fig.clf()
    axprops = dict(xticks = [], yticks = [])
    createPlot.ax1 = plt.subplot(111, frameon = False, **axprops)
    plotTree.totalW = np.float(getNumLeafs(inTree))
    plotTree.totalD = np.float(getTreeDepth(inTree))
    plotTree.xOff = -0.5 / plotTree.totalW
    plotTree.yOff = 1.0
    plotTree(inTree, (0.5, 1.0), '')
    plt.show()

myTree = retrieveTree(0)
createPlotTree(myTree)

在这里插入图片描述

myTree['no surfacing'][3] = 'maybe'
createPlotTree(myTree)

在这里插入图片描述

测试和存储分类器

测试算法：使用决策树执行分类

def classify(inputTree,featLabels,testVec):
    """
    使用决策树执行分类，递归函数
    参数：
        inputTree -- 输入决策树，字典
        featLabels -- 特征名的列表
        testVec -- 测试向量
    返回：
        classLabel -- 预测分类
    """
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)  # 找出第一个匹配firstStr的索引值
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__ == 'dict':
                classLabel = classify(secondDict[key],featLabels,testVec)
            else:
                classLabel = secondDict[key]
    return classLabel

myDat, labels = createDataSet()
labels
myTree = retrieveTree(0)
myTree
print('[1,0]测试向量的标签：',classify(myTree,labels,[1,0]))

[1,0]测试向量的标签： no

决策树的存储

def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,'wb')
    pickle.dump(str(inputTree),fw)
    fw.close()

def grabTree(filename):
    import pickle
    with open(filename,'rb') as fr:
        return pickle.load(fr)

storeTree(myTree,'classifierStorage.txt')
grabTree('classifierStorage.txt')

"{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}"

使用决策树预测隐形眼镜类型

步骤：

收集数据：提供数据文件 http://archive.ics.uci.edu/ml/machine-learning-databases/lenses/
准备数据：解析tab键分隔的数据行
分析数据：快速检查数据，确保正确地解析数据内容，使用createPlot（）函数绘制最终的树形图
训练算法：createTree()函数
测试算法：编写测试函数验证决策树可以正确分类给定的数据实例
使用算法：存储树的数据结构，以便下次使用时无需重新构造树

fr = open("lenses.txt")
lenses = [inst.strip().split('\t') for inst in fr.readlines()]
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']

# 生成决策树
lensesTree = createTree(lenses, lensesLabels)
print("lensesTree = {}".format(lensesTree))

classList: ['no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'hard', 'no lenses', 'soft', 'no lenses', 'no lenses']
classList: ['soft', 'hard', 'soft', 'hard', 'soft', 'hard', 'soft', 'no lenses', 'no lenses', 'hard', 'soft', 'no lenses']
classList: ['soft', 'soft', 'soft', 'soft', 'no lenses', 'soft']
classList: ['no lenses', 'soft']
classList: ['soft']
classList: ['no lenses']
classList: ['soft', 'soft']
classList: ['soft', 'soft']
classList: ['hard', 'hard', 'hard', 'no lenses', 'hard', 'no lenses']
classList: ['hard', 'no lenses', 'no lenses']
classList: ['no lenses']
classList: ['no lenses']
classList: ['hard']
classList: ['hard', 'hard', 'hard']
classList: ['no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses', 'no lenses']
lensesTree = {'tearRate': {'normal': {'astigmatic': {'no': {'age': {'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}}, 'pre': 'soft', 'young': 'soft'}}, 'yes': {'prescript': {'hyper': {'age': {'presbyopic': 'no lenses', 'pre': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}}}, 'reduced': 'no lenses'}}