决策树--从原理到实现

决策树原理与实践

一.引入

决策树基本上是每一本机器学习入门书籍必讲的东西,其决策过程和平时我们的思维很相似,所以非常好理解,同时有一堆信息论的东西在里面,也算是一个入门应用,决策树也有回归和分类,但一般来说我们主要讲的是分类

其实,个人感觉,决策树是从一些数据量中提取特征,按照特征的显著由强到弱来排列。常见应用为:回答一些问题,猜出你心里想的是什么?

为什么第一个问题,永远都是男还是女?为什么?看完这个就知道了

 

二.代码

  1 from math import log
  2 import operator
  3 
  4 def createDataSet():
  5     dataSet = [[1, 1, 'yes'],
  6                [1, 1, 'yes'],
  7                [1, 0, 'no'],
  8                [0, 1, 'no'],
  9                [0, 1, 'no']]
 10     labels = ['no surfacing','flippers']
 11     #change to discrete values
 12     return dataSet, labels
 13 
 14 def calcShannonEnt(dataSet):
 15     numEntries = len(dataSet)
 16     labelCounts = {}
 17     for featVec in dataSet: #the the number of unique elements and their occurance
 18         currentLabel = featVec[-1]
 19         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
 20         labelCounts[currentLabel] += 1
 21     shannonEnt = 0.0
 22     for key in labelCounts:
 23         prob = float(labelCounts[key])/numEntries
 24         shannonEnt -= prob * log(prob,2) #log base 2
 25     return shannonEnt
 26     
 27 def splitDataSet(dataSet, axis, value):
 28     retDataSet = []
 29     for featVec in dataSet:
 30         if featVec[axis] == value:
 31             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
 32             reducedFeatVec.extend(featVec[axis+1:])
 33             retDataSet.append(reducedFeatVec)
 34     return retDataSet
 35     
 36 def chooseBestFeatureToSplit(dataSet):
 37     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
 38     baseEntropy = calcShannonEnt(dataSet)
 39     bestInfoGain = 0.0; bestFeature = -1
 40     for i in range(numFeatures):        #iterate over all the features
 41         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
 42         uniqueVals = set(featList)       #get a set of unique values
 43         newEntropy = 0.0
 44         for value in uniqueVals:
 45             subDataSet = splitDataSet(dataSet, i, value)
 46             prob = len(subDataSet)/float(len(dataSet))
 47             newEntropy += prob * calcShannonEnt(subDataSet)     
 48         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
 49         if (infoGain > bestInfoGain):       #compare this to the best gain so far
 50             bestInfoGain = infoGain         #if better than current best, set to best
 51             bestFeature = i
 52     return bestFeature                      #returns an integer
 53 
 54 def majorityCnt(classList):
 55     classCount={}
 56     for vote in classList:
 57         if vote not in classCount.keys(): classCount[vote] = 0
 58         classCount[vote] += 1
 59     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
 60     return sortedClassCount[0][0]
 61 
 62 def createTree(dataSet,labels):
 63     classList = [example[-1] for example in dataSet]
 64     if classList.count(classList[0]) == len(classList): 
 65         return classList[0]#stop splitting when all of the classes are equal
 66     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
 67         return majorityCnt(classList)
 68     bestFeat = chooseBestFeatureToSplit(dataSet)
 69     bestFeatLabel = labels[bestFeat]
 70     myTree = {bestFeatLabel:{}}
 71     del(labels[bestFeat])
 72     featValues = [example[bestFeat] for example in dataSet]
 73     uniqueVals = set(featValues)
 74     for value in uniqueVals:
 75         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
 76         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
 77     return myTree                            
 78     
 79 def classify(inputTree,featLabels,testVec):
 80     firstStr = inputTree.keys()[0]
 81     secondDict = inputTree[firstStr]
 82     featIndex = featLabels.index(firstStr)
 83     key = testVec[featIndex]
 84     valueOfFeat = secondDict[key]
 85     if isinstance(valueOfFeat, dict): 
 86         classLabel = classify(valueOfFeat, featLabels, testVec)
 87     else: classLabel = valueOfFeat
 88     return classLabel
 89 
 90 def storeTree(inputTree,filename):
 91     import pickle
 92     fw = open(filename,'w')
 93     pickle.dump(inputTree,fw)
 94     fw.close()
 95     
 96 def grabTree(filename):
 97     import pickle
 98     fr = open(filename)
 99     return pickle.load(fr)
100     

三.算法详解

❤信息增益

传入数据集,得到该数据集的增益

 1 def calcShannonEnt(dataSet):
 2     numEntries = len(dataSet)
 3     labelCounts = {}
 4     for featVec in dataSet: #the the number of unique elements and their occurance
 5         currentLabel = featVec[-1]
 6         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
 7         labelCounts[currentLabel] += 1
 8     shannonEnt = 0.0
 9     for key in labelCounts:
10         prob = float(labelCounts[key])/numEntries
11         shannonEnt -= prob * log(prob,2) #log base 2
12     return shannonEnt

得到信息熵后,我们按照获取最大信息增益的方法划分数据集就行了

eg.运行下面的数据集

          [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]

labelCounts是一个map结构
currentLabel  labelCounts[currentLabel]   prob
yes        2                0.4
no         3                0.6

用信息论就可以得到0.4*log(-0.4)+0,6*log(-0.6)=0.971

❤划分数据集

  ※按照给定特征划分数据集

  传入数据集,第axis个(从0开始)特征,该特征的值

  输出根据该数据集划分得到的子数据集

1 def splitDataSet(dataSet, axis, value):
2     retDataSet = []
3     for featVec in dataSet:
4         if featVec[axis] == value:
5             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
6             reducedFeatVec.extend(featVec[axis+1:])
7             retDataSet.append(reducedFeatVec)
8     return retDataSet
 eg.  myDat为
      [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
传入(myDat,0,1),输出

[[1, 'yes'],[1, 'yes'], [0, 'no']]

  ※选择最好的数据集划分方式

  传入数据集

  输出该数据集下按不同特征值排列得到信息熵变化最大的该特征值

 1 def chooseBestFeatureToSplit(dataSet):
 2     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
 3     baseEntropy = calcShannonEnt(dataSet)
 4     bestInfoGain = 0.0; bestFeature = -1
 5     for i in range(numFeatures):        #iterate over all the features
 6         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
 7         uniqueVals = set(featList)       #get a set of unique values
 8         newEntropy = 0.0
 9         for value in uniqueVals:
10             subDataSet = splitDataSet(dataSet, i, value)
11             prob = len(subDataSet)/float(len(dataSet))
12             newEntropy += prob * calcShannonEnt(subDataSet)     
13         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
14         if (infoGain > bestInfoGain):       #compare this to the best gain so far
15             bestInfoGain = infoGain         #if better than current best, set to best
16             bestFeature = i
17     return bestFeature                      #returns an integer
 eg.  myDat为
      [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
传入(myDat)

第一次就是按第一个特征,值为1划分
     按第一个特征,值为0划分
     得到该情况下的信息熵
第二次就是按第二个特征,值为1划分
     按第二个特征,值为0划分
     得到该情况下的信息熵
......
选取信息熵最大时候的特征
  

❤递归构建决策树

1 def majorityCnt(classList):
2     classCount={}
3     for vote in classList:
4         if vote not in classCount.keys(): classCount[vote] = 0
5         classCount[vote] += 1
6     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
7     return sortedClassCount[0][0]

 实现:传入label,统计不同label出现的频率,返回出现频率最大的

 

O(∩_∩)O~创建树啦

两个输入参数:数据集和标签列表

 1 def createTree(dataSet,labels):
 2     classList = [example[-1] for example in dataSet]
 3     if classList.count(classList[0]) == len(classList): 
 4         return classList[0]#stop splitting when all of the classes are equal
 5     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
 6         return majorityCnt(classList)
 7     bestFeat = chooseBestFeatureToSplit(dataSet)
 8     bestFeatLabel = labels[bestFeat]
 9     myTree = {bestFeatLabel:{}}
10     del(labels[bestFeat])
11     featValues = [example[bestFeat] for example in dataSet]
12     uniqueVals = set(featValues)
13     for value in uniqueVals:
14         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
15         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
16     return myTree  

O(∩_∩)O~~可以使用树来决策了

 1 def classify(inputTree,featLabels,testVec):
 2     firstStr = inputTree.keys()[0]
 3     secondDict = inputTree[firstStr]
 4     featIndex = featLabels.index(firstStr)
 5     key = testVec[featIndex]
 6     valueOfFeat = secondDict[key]
 7     if isinstance(valueOfFeat, dict): 
 8         classLabel = classify(valueOfFeat, featLabels, testVec)
 9     else: classLabel = valueOfFeat
10     return classLabel

 测试如下:

1 >>> import trees
2 >>> myDat,labels=trees.createDataSet()
3 >>> myTree=trees.createTree(myDat,labels)
4 >>> myTree
5 {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
6 >>> lab=trees.classify(myTree,['no surfacing','flippers'],[0,1])
7 >>> lab
8 'no'

 

转载于:https://www.cnblogs.com/xiaoyingying/p/7509367.html

### 扣子智能体平台功能与使用说明 #### 平台概述 扣子Coze)是由字节跳动推出的一款面向终端用户的智能体开发平台[^3]。该平台支持用户通过零代码或低代码方式快速构建基于人工智能大模型的各种智能体应用,并能够将其部署至其他网站或者通过 API 集成到现有的系统中。 #### 快速搭建智能体 无论是具备还是缺乏编程基础的用户,都能够借助扣子平台迅速创建一个 AI 智能体。例如,可以参照一篇教程中的实例来学习如何打造一个解决日常生活问题的小助手[^1]。这不仅降低了技术门槛,还使得更多的人有机会参与到智能化工具的设计过程中去。 #### 插件系统的利用 为了进一步增强所建智能体的能力,在其技能配置环节可加入不同类型的插件。一旦添加成功,则可以在编写提示语句的时候直接调用这些插件,亦或是融入自动化流程里实现更复杂操作逻辑的目的[^2]。这种灵活运用外部资源的方法极大地拓宽了单个智能体所能覆盖的应用场景范围。 ```python # 示例:假设我们有一个简单的 Python 脚本用于模拟调用某个插件功能 def call_plugin(plugin_name, parameters): result = f"Plugin {plugin_name} called with params: {parameters}" return result example_call = call_plugin("weather", {"location": "Beijing"}) print(example_call) ``` 上述代码片段仅作为概念展示之用,实际情况下具体实现会依据官方文档指导完成。 #### 总结 综上所述,扣子智能体平台提供了便捷高效的途径让用户无需深厚编码背景即可打造出满足特定需求的AI解决方案;同时它开放性强允许接入第三方服务从而提升整体性能表现。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值