1.使用决策树进行分类示意图
2.决策树算法示意图

解读:最终目的是要让计算机自动寻找决策边界
3.决策树过拟合的情况
解读:边界被划为狭长的线段
3.决策树分类代码

4.不同最小分支界限的分类结果

解读:设定过于小的分支界限,可能会导致过拟合
5.需要通过调节各种参数获得性能的提升

解读:调节min_samples_split参数为50,30,20,10分别可以得到 0.912, 0.916, 0.924 的分类精度。可见参数的选择对分类器的性能影响很大。
6.决策树的度量——熵(ENtropy)与纯度(Purity)
上图注解:Formula of Entropy
Some sources use other bases for the logarithm (for example, they might use log base 10 or the natural log, with a base of approx. 2.72)--those details can change the maximal value of entropy that you can get. In our case, where there are 2 classes, the log base 2 formula that we use will have a maximal value of 1.纯度越高的节点,熵越小。
In practice, when you use a decision tree, you will rarely have to deal with the details of the log base--the important takeaway here is that lower entropy points toward more organized data, and that a decision tree uses that
as a way how to classify events.
注解:纯度越高,熵越小。
计算熵的公式中每一项即是样本中的一个类别。
算法流程
1.计算起始节点的熵值
可以使用scipy计算熵值:
假如某节点有三个样本,两个样本为类别1,剩余一个样本为类别2。则该节点的熵值可以用如下的方式使用scipy计算。
If you're scipy-literate, you can also turn a calculation like this into a two-liner:
import scipy.stats
print scipy.stats.entropy([2,1],base=2)
2.计算信息增益

解读:决策树算法的目标就是最大化信息增益
7.偏差和方差有什么区别?
答:
1.高偏差机器学习算法实际上会忽略数据。它几乎没有任何能力学习数据,因此被称为偏差。
A high bias machine learning algorithm is one that practically ignores the data.It has almost no capacitor in anything, and it is called a bias. So a, a bias car would be one that I can train, and no matter which way I
train it, it doesn't do anything differently.Now, that's generally a bad idea in machine learning.
2.高方差机器学习算法对数据非常敏感,并且只能复制曾经看过的东西。他的问题是在之前未见过的情况下表现很差。因为他没有适当的偏差来泛化新东西。
the other extreme and make a car that's extremely perceptive to data,and it can only replicate stuff it's seen before.That is an extremely high variance algorithm.And the problem with that is, it'll react very poorly in situations
it hasn't seen before because it doesn't have the right bias to generalize to new stuff. So, in reality, what you want is something in the middle.
3.真是需要的平衡中庸的算法。
8.决策树算法总结
Pros:决策树算法易于使用,且易于可视化表示。Cons:决策树算法容易过拟合,尤其在你拥有包含大量特征的数据时。负责的决策树会过拟合数据。
需要谨慎地对待决策树挑选的参数,调整参数以防止发生过拟合。测量决策树的生长情况,在适当时候停止决策树的生长。
9.准确率实验比较

解读:特征的数量越多,算法将会越复杂,准确率也会改变。

解读:不同的分类器参数设置,准确率也会改变。
解读:不同的分类器参数设置,准确率也会改变。