ID3算法的python实现

本文详细介绍了一个经典的决策树算法实现过程,包括如何选择最佳划分特征、递归构建决策树及最终决策输出等内容。通过实例演示了算法的具体应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

这篇文章的内容接着http://blog.youkuaiyun.com/xueyunf/article/details/9214727的内容,所有还有部分函数在http://blog.youkuaiyun.com/xueyunf/article/details/9212827中,由于这个算法需要理解的内容比较多,所以我分成了3篇分别介绍,因为自己也是用了3天的时间才理解了这一经典算法。当然很犀利的童鞋也许很短时间就理解了这一算法,那么这篇文章也就不适合你了,可以跳过了,读了后不会有太多收获的。

下面我就贴出代码来,为初学者提示一点东西:

def majorityCnt(classList):
    classCount ={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
        classCount[vote]=1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedClassCount[0][0]
 

def createTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0])==len(classList):
        return classList[0]
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree
第一个函数为选出出现次数最多的分类名称。

第二个函数式建立决策树,也就是今天我想说的最关键的部分的代码,我们可以发现这是一个递归函数,首先我来说明跳出递归的条件,也就是类别完全相同时跳出递归,或者我们将所有的特征已经用尽则跳出递归。我们不难发现,第一个if是第一种情况,第二个if对应第二种情况。

然后我们来处理不是这两种情况的情况,每次都利用前面的选择最优划分将数据进行划分,同时将该标签插入树中,并删除该标签,然后再次将剩下的数据和标签形成的新的结构放入函数中递归进行构建子决策树,这样一棵完整的决策树就建立了。

下面给出程序运行的截图:(所谓有图有真相,无图无真相啊,我用的python的开发IDE是Eric5顺便推荐给大家)

最后给大家3篇文章所有的代码:

import math 
import operator

def calcShannonEnt(dataset):
    numEntries = len(dataset)
    labelCounts = {}
    for featVec in dataset:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] +=1
        
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob*math.log(prob, 2)
    return shannonEnt
    
def CreateDataSet():
    dataset = [[1, 1, 'yes' ], 
               [1, 1, 'yes' ], 
               [1, 0, 'no'], 
               [0, 1, 'no'], 
               [0, 1, 'no']]
    labels = ['no surfacing', 'flippers']
    return dataset, labels

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    
    return retDataSet

def chooseBestFeatureToSplit(dataSet):
    numberFeatures = len(dataSet[0])-1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0;
    bestFeature = -1;
    for i in range(numberFeatures):
        featList = [example[i] for example in dataSet]
        print(featList)
        uniqueVals = set(featList)
        print(uniqueVals)
        newEntropy =0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if(infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

def majorityCnt(classList):
    classCount ={}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote]=0
        classCount[vote]=1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) 
    return sortedClassCount[0][0]
 

def createTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0])==len(classList):
        return classList[0]
    if len(dataSet[0])==1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

        
        
myDat,labels = CreateDataSet()
print(calcShannonEnt(myDat))

print(splitDataSet(myDat, 1, 1))

print(chooseBestFeatureToSplit(myDat))

print(createTree(myDat, labels))



### ID3算法Python实现 ID3(Iterative Dichotomiser 3)是一种用于构建决策树的经典算法。该算法通过熵和信息增益来选择最优特征作为节点分裂的标准。 #### 构建ID3决策树的核心逻辑 为了创建一个基于ID3算法的决策树模型,可以定义`DecisionTree`类并实现训练方法: ```python from collections import Counter import math import numpy as np class DecisionNode: def __init__(self, feature_i=None, threshold=None, value=None, true_branch=None, false_branch=None): self.feature_i = feature_i # Index for the feature that is tested self.threshold = threshold # Threshold value for feature self.value = value # Value if node represents leaf node self.true_branch = true_branch # Child subtree or leaf node for samples meeting threshold self.false_branch = false_branch # Child subtree or leaf node for samples not meeting threshold def entropy(y): hist = np.bincount(y) ps = hist / len(y) return -np.sum([p * np.log2(p) for p in ps if p > 0]) class DecisionTree(object): def fit(self, X, y): self.root = self._build_tree(X, y) def _best_split(self, X, y): best_gain = 0 split_idx, split_threshold = None, None for idx in range(X.shape[1]): thresholds = set(sorted(X[:, idx])) for thr in thresholds: gain = self.information_gain(y, X[:, idx] >= thr) if gain > best_gain: best_gain = gain split_idx = idx split_threshold = thr return split_idx, split_threshold @staticmethod def information_gain(parent_y, child_y): parent_entropy = entropy(parent_y) n_left = sum(child_y) n_right = len(child_y) - n_left e_child = (n_left/len(child_y)) * entropy(np.array(child_y)[child_y]) + \ (n_right/len(child_y)) * entropy(np.array(child_y)[~child_y]) ig = parent_entropy - e_child return ig def _build_tree(self, X, y, depth=0): num_samples_per_class = [np.sum(y == i) for i in range(self.n_classes_)] predicted_class = np.argmax(num_samples_per_class) if max(num_samples_per_class) == len(y): return DecisionNode(value=predicted_class) feature_i, threshold = self._best_split(X, y) if feature_i is None: return DecisionNode(value=predicted_class) indices_left = X[:, feature_i] < threshold X_left, y_left = X[indices_left], y[indices_left] X_right, y_right = X[~indices_left], y[~indices_left] true_branch = self._build_tree(X_left, y_left, depth + 1) false_branch = self._build_tree(X_right, y_right, depth + 1) return DecisionNode(feature_i=feature_i, threshold=threshold, true_branch=true_branch, false_branch=false_branch) if __name__ == "__main__": from sklearn.datasets import load_iris iris = load_iris() X = iris.data[:100,:2] y = iris.target[:100] clf = DecisionTree() clf.fit(X,y) ``` 此代码片段展示了如何利用自定义函数计算数据集的信息熵以及信息增益,并以此为基础递归地建立一棵完整的决策树[^1]。 上述例子仅考虑了二元分类问题;对于多类别或多标签的情况,则需相应调整程序结构以适应更复杂的应用场景。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值