《Machine Learning in Action》读书笔记之二：决策树（ID3）

最新推荐文章于 2019-08-15 11:25:04 发布

翻译最新推荐文章于 2019-08-15 11:25:04 发布 · 849 阅读

文章标签：

#ML in Action

机器学习专栏收录该内容

9 篇文章

订阅专栏

本文介绍了如何使用信息增益选择最佳特征来分割数据集，并基于此构建决策树。此外，还详细阐述了如何利用构建好的决策树进行分类预测，以及决策树模型的持久化存储方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 对一个数据集，建立决策树，该如何split数据才更好呢，这里引入信息增益（Information Gain）的概念，指的是spilt数据前后信息的变化，在某位特征上split数据得到最大的信息增益，该特征就是最好的分离点。

香农熵定义为信息的期望值，如下图所示：

熵的计算python代码如下：

from math import log
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

2. 拆分数据集（splitting dataset）

def splitDataSet(dataSet, axis, value):

    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

参数axis是Feature的index，参数value是Feature的值。

3.选择最好的特征分割点

通过对每个特征的循环计算，能够得到最大信息增益的特征就是最好的分割点。

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

4.构建决策树

利用递归的方式构建决策树，循环终止的条件（1）如果所有类别的标签都相同，返回该标签。（The first stopping condition is that if all the class labels are the same, then you return this label.）（2）当没有多个特征用来分割时（The second stopping condition is the case when there are no more features to split.）

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]
    
def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

5. 利用构建好的决策树进行分类

def classify(inputTree,featLabels,testVec):
    firstStr = inputTree.keys()[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict): 
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else: classLabel = valueOfFeat
    return classLabel

6. 保存已经构建好的决策树

def storeTree(inputTree,filename):

    import pickle
    fw = open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
       
def grabTree(filename):
    import pickle
    fr = open(filename)
    return pickle.load(fr)