作者:chen_h
微信号 & QQ:862251340
微信公众号:coderpai
只有当人们可以清楚地阅读和理解其决策时,算法才是透明的。 尽管深度学习是当今机器学习的超级明星,但它是一种不透明的算法,我们不知道决策的原因。 在此,决策树算法仍然保持其受欢迎程度,因为它们可以产生透明的决策。 ID3使用信息增益,而C4.5使用增益比进行分割。 这里,CART是一种替代决策树构建算法。 它可以处理分类和回归任务。 此算法使用名为gini index的新度量标准为分类任务创建决策点。 我们将从头开始逐步提及CART决策树示例。
我们将在ID3中处理相同的数据集。 根据 Outlook,Temperature,humidity 和 wind 因子,有14个高尔夫球场决定。
| Day | Outlook | Temp. | Humidity | Wind | Decision |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | Weak | No |
| 2 | Sunny | Hot | High | Strong | No |
| 3 | Overcast | Hot | High | Weak | Yes |
| 4 | Rain | Mild | High | Weak | Yes |
| 5 | Rain | Cool | Normal | Weak | Yes |
| 6 | Rain | Cool | Normal | Strong | No |
| 7 | Overcast | Cool | Normal | Strong | Yes |
| 8 | Sunny | Mild | High | Weak | No |
| 9 | Sunny | Cool | Normal | Weak | Yes |
| 10 | Rain | Mild | Normal | Weak | Yes |
| 11 | Sunny | Mild | Normal | Strong | Yes |
| 12 | Overcast | Mild | High | Strong | Yes |
| 13 | Overcast | Hot | Normal | Weak | Yes |
| 14 | Rain | Mild | High | Strong | No |
基尼指数
基尼指数 (gini index) 是 CART 中分类任务的度量标准。它存储每个类别的平方概率的总和,我们可以得到如下:
Gini=1–Σ(Pi)2fori=1tonumberofclassesGini = 1 – Σ (Pi)^2 for i=1 to number of classesGini=1–Σ(Pi)2fori=1tonumberofclasses
Outlook 因子
Outlook 是一个标签因子。它可以是 sunny,overcast 或者 rain。我们来总结一下 Outlook 的特征属性和决策。
| Outlook | Yes | No | Number of instances |
|---|---|---|---|
| Sunny | 2 | 3 | 5 |
| Overcast | 4 | 0 | 4 |
| Rain | 3 | 2 | 5 |
Gini(Outlook=Sunny)=1–(2/5)2–(3/5)2=1–0.16–0.36=0.48Gini(Outlook=Sunny) = 1 – (2/5)^2 – (3/5)^2 = 1 – 0.16 – 0.36 = 0.48Gini(Outlook=Sunny)=1–(2/5)2–(3/5)2=1–0.16–0.36=0.48
Gini(Outlook=Overcast)=1–(4/4)2–(0/4)2=0Gini(Outlook=Overcast) = 1 – (4/4)^2 – (0/4)^2 = 0Gini(Outlook=Overcast)=1–(4/4)2–(0/4)2=0
Gini(Outlook=Rain)=1–(3/5)2–(2/5)2=1–0.36–0.16=0.48Gini(Outlook=Rain) = 1 – (3/5)^2 – (2/5)^2 = 1 – 0.36 – 0.16 = 0.48Gini(Outlook=Rain)=1–(3/5)2–(2/5)2=1–0.36–0.16=0.48
然后,我们将计算 Outlook 特征的基尼指数的加权和。
Gini(Outlook)=(5/14)x0.48+(4/14)x0+(5/14)x0.48=0.171+0+0.171=0.342Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342Gini(Outlook)=(5/14)x0.48+(4/14)x0+(5/14)x0.48=0.171+0+0.171=0.342
Temperature 因子
同样,Temperature 也是一个标签因子,它可能有3个不同的值:cool,hot 和 mild。让我们总结 Temperature 特征的决策。
| Temperature | Yes | No | Number of instances |
|---|---|---|---|
| Hot | 2 | 2 | 4 |
| Cool | 3 | 1 | 4 |
| Mild | 4 | 2 | 6 |
Gini(Temp=Hot)=1–(2/4)2–(2/4)2=0.5Gini(Temp=Hot) = 1 – (2/4)^2 – (2/4)^2 = 0.5Gini(Temp=Hot)=1–(2/4)2–(2/4)2=0.5
Gini(Temp=Cool)=1–(3/4)2–(1/4)2=1–0.5625–0.0625=0.375Gini(Temp=Cool) = 1 – (3/4)^2 – (1/4)^2 = 1 – 0.5625 – 0.0625 = 0.375Gini(Temp=Cool)=1–(3/4)2–(1/4)2=1–0.5625–0.0625=0.375
Gini(Temp=Mild)=1–(4/6)2–(2/6)2=1–0.444–0.111=0.445Gini(Temp=Mild) = 1 – (4/6)^2 – (2/6)^2 = 1 – 0.444 – 0.111 = 0.445Gini(Temp=Mild)=1–(4/6)2–(2/6)2=1–0.444–0.111=0.445
我们将计算 Temperature 特征的基尼指数的加权和。
Gini(Temp)=(4/14)x0.5+(4/14)x0.375+(6/14)x0.445=0.142+0.107+0.190=0.439Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439Gini(Temp)=(4/14)x0.5+(4/14)x0.375+(6/14)x0.445=0.142+0.107+0.190=0.439
Humidity 因子
humidity 因子是二元特征,它可以是 high 或者 normal。
| Humidity | Yes | No | Number of instances |
|---|---|---|---|
| High | 3 | 4 | 7 |
| Normal | 6 | 1 | 7 |
Gini(Humidity=High)=1–(3/7)2–(4/7)2=1–0.183–0.326=0.489Gini(Humidity=High) = 1 – (3/7)^2 – (4/7)^2 = 1 – 0.183 – 0.326 = 0.489Gini(Humidity=High)=1–(3/7)2–(4/7)2=1–0.183–0.326=0.489
Gini(Humidity=Normal)=1–(6/7)2–(1/7)2=1–0.734–0.02=0.244Gini(Humidity=Normal) = 1 – (6/7)^2 – (1/7)^2 = 1 – 0.734 – 0.02 = 0.244Gini(Humidity=Normal)=1–(6/7)2–(1/7)2=1–0.734–0.02=0.244
我们将计算 Humidity 特征的基尼指数的加权和。
Gini(Humidity)=(7/14)∗0.489+(7/14)∗0.244=0.367Gini(Humidity) = (7/14) * 0.489 + (7/14) * 0.244 = 0.367Gini(Humidity)=(7/14)∗0.489+(7/14)∗0.244=0.367
wind 因子
wind 也是一个二元标签,它可以是 weak 或者 strong。
| Wind | Yes | No | Number of instances |
|---|---|---|---|
| Weak | 6 | 2 | 8 |
| Strong | 3 | 3 | 6 |
Gini(Wind=Weak)=1–(6/8)2–(2/8)2=1–0.5625–0.062=0.375Gini(Wind=Weak) = 1 – (6/8)^2 – (2/8)^2 = 1 – 0.5625 – 0.062 = 0.375Gini(Wind=Weak)=1–(6/8)2–(2/8)2=1–0.5625–0.062=0.375
Gini(Wind=Strong)=1–(3/6)2–(3/6)2=1–0.25–0.25=0.5Gini(Wind=Strong) = 1 – (3/6)^2 – (3/6)^2 = 1 – 0.25 – 0.25 = 0.5Gini(Wind=Strong)=1–(3/6)2–(3/6)2=1–0.25–0.25=0.5
Gini(Wind)=(8/14)∗0.375+(6/14)∗0.5=0.428Gini(Wind) = (8/14) * 0.375 + (6/14) * 0.5 = 0.428Gini(Wind)=(8/14)∗0.375+(6/14)∗0.5=0.428
决策树构建
我们计算每个特征的基尼指数的值,根节点是 Outlook ,因为基尼指数最低。
| Feature | Gini index |
|---|---|
| Outlook | 0.342 |
| Temperature | 0.439 |
| Humidity | 0.367 |
| Wind | 0.428 |
我们可以得到决策树为:

你可能从图中也注意到了,如果是 overcast ,那么决策肯定是 yes,也就是说 overcast 叶子节点已经结束分裂了。

我们将在以下步骤中将上面相同的原则应用于这些子数据集。
我们来集中注意力来看看 Outlook = sunny 的情况,我们需要分别计算 Temperature,humidity 和 wind 因子的基尼指数得分。
| Day | Outlook | Temp. | Humidity | Wind | Decision |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | Weak | No |
| 2 | Sunny | Hot | High | Strong | No |
| 8 | Sunny | Mild | High | Weak | No |
| 9 | Sunny | Cool | Normal | Weak | Yes |
| 11 | Sunny | Mild | Normal | Strong | Yes |
Outlook=sunny时,Temp 的基尼指数
| Temperature | Yes | No | Number of instances |
|---|---|---|---|
| Hot | 0 | 2 | 2 |
| Cool | 1 | 0 | 1 |
| Mild | 1 | 1 | 2 |
Gini(Outlook=SunnyandTemp.=Hot)=1–(0/2)2–(2/2)2=0Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)^2 – (2/2)^2 = 0Gini(Outlook=SunnyandTemp.=Hot)=1–(0/2)2–(2/2)2=0
Gini(Outlook=SunnyandTemp.=Cool)=1–(1/1)2–(0/1)2=0Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)^2 – (0/1)^2 = 0Gini(Outlook=SunnyandTemp.=Cool)=1–(1/1)2–(0/1)2=0
Gini(Outlook=SunnyandTemp.=Mild)=1–(1/2)2–(1/2)2=1–0.25–0.25=0.5Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)^2 – (1/2)^2 = 1 – 0.25 – 0.25 = 0.5Gini(Outlook=SunnyandTemp.=Mild)=1–(1/2)2–(1/2)2=1–0.25–0.25=0.5
Gini(Outlook=SunnyandTemp.)=(2/5)∗0+(1/5)∗0+(2/5)∗0.5=0.2Gini(Outlook=Sunny and Temp.) = (2/5)*0 + (1/5)*0 + (2/5)*0.5 = 0.2Gini(Outlook=SunnyandTemp.)=(2/5)∗0+(1/5)∗0+(2/5)∗0.5=0.2
outlook=sunny 时,humidity 的基尼指数
| Humidity | Yes | No | Number of instances |
|---|---|---|---|
| High | 0 | 3 | 3 |
| Normal | 2 | 0 | 2 |
Gini(Outlook=SunnyandHumidity=High)=1–(0/3)2–(3/3)2=0Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)^2 – (3/3)^2 = 0Gini(Outlook=SunnyandHumidity=High)=1–(0/3)2–(3/3)2=0
Gini(Outlook=SunnyandHumidity=Normal)=1–(2/2)2–(0/2)2=0Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)^2 – (0/2)^2 = 0Gini(Outlook=SunnyandHumidity=Normal)=1–(2/2)2–(0/2)2=0
Gini(Outlook=SunnyandHumidity)=(3/5)∗0+(2/5)∗0=0Gini(Outlook=Sunny and Humidity) = (3/5)*0 + (2/5)*0 = 0Gini(Outlook=SunnyandHumidity)=(3/5)∗0+(2/5)∗0=0
outlook = sunny 时,wind 的基尼指数
| Wind | Yes | No | Number of instances |
|---|---|---|---|
| Weak | 1 | 2 | 3 |
| Strong | 1 | 1 | 2 |
Gini(Outlook=SunnyandWind=Weak)=1–(1/3)2–(2/3)2=0.266Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)^2 – (2/3)^2 = 0.266Gini(Outlook=SunnyandWind=Weak)=1–(1/3)2–(2/3)2=0.266
Gini(Outlook=SunnyandWind=Strong)=1−(1/2)2–(1/2)2=0.2Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)^2 – (1/2)^2 = 0.2Gini(Outlook=SunnyandWind=Strong)=1−(1/2)2–(1/2)2=0.2
Gini(Outlook=SunnyandWind)=(3/5)∗0.266+(2/5)∗0.2=0.466Gini(Outlook=Sunny and Wind) = (3/5)*0.266 + (2/5)*0.2 = 0.466Gini(Outlook=SunnyandWind)=(3/5)∗0.266+(2/5)∗0.2=0.466
确定 outlook=sunny 时的分支
我们在上面计算了 outlook = sunny 时的各个特征的基尼指数,发现 humidity 的基尼指数是最低的,所以选择 humidity 作为节点。
| Feature | Gini index |
|---|---|
| Temperature | 0.2 |
| Humidity | 0 |
| Wind | 0.466 |
我们将 humidity 作为一个节点。

如图所示,对于 humidity = high 的情况,决策永远是 no。另一方面,对于 humidity=normal 的情况,决策永远是 yes,所以这个分支就结束了。

接下来,我们来看看 outlook=rain 的情况。
outlook=rain
| Day | Outlook | Temp. | Humidity | Wind | Decision |
|---|---|---|---|---|---|
| 4 | Rain | Mild | High | Weak | Yes |
| 5 | Rain | Cool | Normal | Weak | Yes |
| 6 | Rain | Cool | Normal | Strong | No |
| 10 | Rain | Mild | Normal | Weak | Yes |
| 14 | Rain | Mild | High | Strong | No |
当 outlook=rain 时,我们来计算 Temp, Humidity 和 wind 的基尼指数得分。
outlook=rain 时,temprature 的基尼指数
| Temperature | Yes | No | Number of instances |
|---|---|---|---|
| Cool | 1 | 1 | 2 |
| Mild | 2 | 1 | 3 |
Gini(Outlook=RainandTemp.=Cool)=1–(1/2)2–(1/2)2=0.5Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)^2 – (1/2)^2 = 0.5Gini(Outlook=RainandTemp.=Cool)=1–(1/2)2–(1/2)2=0.5
Gini(Outlook=RainandTemp.=Mild)=1–(2/3)2–(1/3)2=0.444Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)^2 – (1/3)^2 = 0.444Gini(Outlook=RainandTemp.=Mild)=1–(2/3)2–(1/3)2=0.444
Gini(Outlook=RainandTemp.)=(2/5)∗0.5+(3/5)∗0.444=0.466Gini(Outlook=Rain and Temp.) = (2/5)*0.5 + (3/5)*0.444 = 0.466Gini(Outlook=RainandTemp.)=(2/5)∗0.5+(3/5)∗0.444=0.466
outlook=rain 时,humidity 的基尼指数
| Humidity | Yes | No | Number of instances |
|---|---|---|---|
| High | 1 | 1 | 2 |
| Normal | 2 | 1 | 3 |
Gini(Outlook=RainandHumidity=High)=1–(1/2)2–(1/2)2=0.5Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)^2 – (1/2)^2 = 0.5Gini(Outlook=RainandHumidity=High)=1–(1/2)2–(1/2)2=0.5
Gini(Outlook=RainandHumidity=Normal)=1–(2/3)2–(1/3)2=0.444Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)^2 – (1/3)^2 = 0.444Gini(Outlook=RainandHumidity=Normal)=1–(2/3)2–(1/3)2=0.444
Gini(Outlook=RainandHumidity)=(2/5)∗0.5+(3/5)∗0.444=0.466Gini(Outlook=Rain and Humidity) = (2/5)*0.5 + (3/5)*0.444 = 0.466Gini(Outlook=RainandHumidity)=(2/5)∗0.5+(3/5)∗0.444=0.466
outlook=rain 时,wind 的基尼指数
| Wind | Yes | No | Number of instances |
|---|---|---|---|
| Weak | 3 | 0 | 3 |
| Strong | 0 | 2 | 2 |
Gini(Outlook=RainandWind=Weak)=1–(3/3)2–(0/3)2=0Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)^2 – (0/3)^2 = 0Gini(Outlook=RainandWind=Weak)=1–(3/3)2–(0/3)2=0
Gini(Outlook=RainandWind=Strong)=1–(0/2)2–(2/2)2=0Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)^2 – (2/2)^2 = 0Gini(Outlook=RainandWind=Strong)=1–(0/2)2–(2/2)2=0
Gini(Outlook=RainandWind)=(3/5)∗0+(2/5)∗0=0Gini(Outlook=Rain and Wind) = (3/5)*0 + (2/5)*0 = 0Gini(Outlook=RainandWind)=(3/5)∗0+(2/5)∗0=0
outlook=rain 时的决策过程
当 outlook=rain 时,我们发现 wind 特征的基尼指数是最小的,所以我们选择它作为我们的分节点。
| Feature | Gini index |
|---|---|
| Temperature | 0.466 |
| Humidity | 0.466 |
| Wind | 0 |
接下里,我们可以重新画我们的树结构了。

如图所示,当 wind=weak 时,决策一直是 yes。当 wind=strong 时,决策一直是 no。这就意味着决策到此结束了。

因此,决策树的构建已经结束了。我们手工制作了这颗决策树。顺便说一下,你可能会发现这个跟我们在 ID3 例子中创建了完全相同的树。这并不意味着 ID3 和 CART 算法始终生成的树是相同的。我们从这三篇教程中发现,CART 比 ID3 和 C4.5 都简单,不是吗???
10万+

被折叠的 条评论
为什么被折叠?



