作者:chen_h
微信号 & QQ:862251340
微信公众号:coderpai
决策树现在仍然是数据科学界的热门话题。 在这里,ID3是最常见的传统决策树算法,但它有瓶颈。 属性必须是名义值,数据集不得包含缺失数据,最后算法往往会过度拟合。 在这里,ID3的发明者Ross Quinlan对这些瓶颈做了一些改进,并创建了一个名为C4.5的新算法。**(ID3 还有一些什么瓶颈,需要调研一下。) **现在,该算法可以创建更通用的模型,包括连续数据,并可以处理丢失的数据。 此外,一些资源如 Weka 将此算法命名为 J48。 实际上,它指的是C4.5版本8的重新实现。
我们将为以下数据集创建决策表。 它告知决策因素。 如果你学习了上面的 ID3 算法可能对数据集很熟悉。 不同之处在于温度和湿度列具有连续值而不是标称值。
Day | Outlook | Temp. | Humidity | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | 85 | 85 | Weak | No |
2 | Sunny | 80 | 90 | Strong | No |
3 | Overcast | 83 | 78 | Weak | Yes |
4 | Rain | 70 | 96 | Weak | Yes |
5 | Rain | 68 | 80 | Weak | Yes |
6 | Rain | 65 | 70 | Strong | No |
7 | Overcast | 64 | 65 | Strong | Yes |
8 | Sunny | 72 | 95 | Weak | No |
9 | Sunny | 69 | 70 | Weak | Yes |
10 | Rain | 75 | 80 | Weak | Yes |
11 | Sunny | 75 | 70 | Strong | Yes |
12 | Overcast | 72 | 90 | Strong | Yes |
13 | Overcast | 81 | 75 | Weak | Yes |
14 | Rain | 71 | 80 | Strong | No |
我们将在 ID3 示例中完成我们所做的工作。首先,我们需要计算全局熵。上表中有 14 个实例,9 个实例指的是 yes 的决定,5 个实例指的是 no 的决定。
Eneropy(Decision)=∑−p(x)∗log2p(x)=−p(yes)∗log2p(yes)−p(no)∗log2p(no)=−914∗log2(914)−514∗log2(514)=0.940Eneropy(Decision)=\sum-p(x)*log_2{p(x)} = -p(yes)*log_2{p(yes)}-p(no)*log_2{p(no)}=-\frac{9}{14}*log_2(\frac{9}{14})-\frac{5}{14}*log_2(\frac{5}{14})=0.940Eneropy(Decision)=∑−p(x)∗log2p(x)=−p(yes)∗log2p(yes)−p(no)∗log2p(no)=−149∗log2(149)−145∗log2(145)=0.940
在 ID3 算法中,我们计算了每个属性的信息增益。在这里,我们需要计算信息增益比,而不是信息增益。
GainRatio(A)=Gain(A)/SplitInfo(A)GainRatio(A)=Gain(A)/SplitInfo(A)GainRatio(A)=Gain(A)/SplitInfo(A)
SplitInfo(A)=−∑∣Dj∣∣D∣∗log2(∣Dj∣∣D∣)SplitInfo(A)=-\sum\frac{|D_j|}{|D|}*log_2(\frac{|D_j|}{|D|})SplitInfo(A)=−∑∣D∣∣Dj∣∗log2(∣D∣∣Dj∣)
Wind 因子
Wind 是一个很常规的因子,它有两种属性 weak 和 strong。
Gain(Decision,Wind)=Entropy(Decision)–∑(p(Decision∣Wind)∗Entropy(Decision∣Wind))Gain(Decision, Wind) = Entropy(Decision) – ∑ ( p(Decision|Wind) * Entropy(Decision|Wind) )Gain(Decision,Wind)=Entropy(Decision)–∑(p(Decision∣Wind)∗Entropy(Decision∣Wind))
Gain(Decision,Wind)=Entropy(Decision)–[p(Decision∣Wind=Weak)∗Entropy(Decision∣Wind=Weak)]+[p(Decision∣Wind=Strong)∗Entropy(Decision∣Wind=Strong)]Gain(Decision, Wind) = Entropy(Decision) – [ p(Decision|Wind=Weak) * Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) * Entropy(Decision|Wind=Strong) ]Gain(Decision,Wind)=Entropy(Decision)–[p(Decision∣Wind=Weak)∗Entropy(Decision∣Wind=Weak)]+[p(Decision∣Wind=Strong)∗Entropy(Decision∣Wind=Strong)]
这里有 8 个 weak 实例,2 个决策是 no,6 个决策是 yes 。
Entropy(Decision∣Wind=Weak)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=–(2/8).log2(2/8)–(6/8).log2(6/8)=0.811Entropy(Decision|Wind=Weak) = – p(No) * log_2p(No) – p(Yes) * log_2p(Yes) = – (2/8) . log2(2/8) – (6/8) . log2(6/8) = 0.811Entropy(Decision∣Wind=Weak)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=–(2/8).log2(2/8)–(6/8).log2(6/8)=0.811
Entropy(Decision∣Wind=Strong)=–(3/6)∗log2(3/6)–(3/6)∗log2(3/6)=1Entropy(Decision|Wind=Strong) = – (3/6) * log2(3/6) – (3/6) * log2(3/6) = 1Entropy(Decision∣Wind=Strong)=–(3/6)∗log2(3/6)–(3/6)∗log2(3/6)=1
Gain(Decision,Wind)=0.940–(8/14)∗(0.811)–(6/14)∗(1)=0.940–0.463–0.428=0.049Gain(Decision, Wind) = 0.940 – (8/14)*(0.811) – (6/14)*(1) = 0.940 – 0.463 – 0.428 = 0.049Gain(Decision,Wind)=0.940–(8/14)∗(0.811)–(6/14)∗(1)=0.940–0.463–0.428=0.049
当 wind=weak 时,我们做了 8 个决策;当 wind=strong 时,我们做了 6 个决策。
SplitInfo(Decision,Wind)=−(8/14)∗log2(8/14)–(6/14)∗log2(6/14)=0.461+0.524=0.985SplitInfo(Decision, Wind) = -(8/14)*log_2(8/14) – (6/14)*log_2(6/14) = 0.461 + 0.524 = 0.985SplitInfo(Decision,Wind)=−(8/14)∗log2(8/14)–(6/14)∗log2(6/14)=0.461+0.524=0.985
GainRatio(Decision,Wind)=Gain(Decision,Wind)/SplitInfo(Decision,Wind)=0.049/0.985=0.049GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985 = 0.049GainRatio(Decision,Wind)=Gain(Decision,Wind)/SplitInfo(Decision,Wind)=0.049/0.985=0.049
Outlook 因子
Outlook 也是一个名义上的因子,它可能的取值是 sunny,overcast 和 rain。
Gain(Decision,Outlook)=Entropy(Decision)–∑(p(Decision∣Outlook)∗Entropy(Decision∣Outlook))Gain(Decision, Outlook) = Entropy(Decision) – ∑ ( p(Decision|Outlook) * Entropy(Decision|Outlook) )Gain(Decision,Outlook)=Entropy(Decision)–∑(p(Decision∣Outlook)∗Entropy(Decision∣Outlook))
Gain(Decision,Outlook)=Entropy(Decision)–p(Decision∣Outlook=Sunny)∗Entropy(Decision∣Outlook=Sunny)–p(Decision∣Outlook=Overcast)∗Entropy(Decision∣Outlook=Overcast)–p(Decision∣Outlook=Rain)∗Entropy(Decision∣Outlook=Rain)Gain(Decision, Outlook) = Entropy(Decision) – p(Decision|Outlook=Sunny) * Entropy(Decision|Outlook=Sunny) – p(Decision|Outlook=Overcast) * Entropy(Decision|Outlook=Overcast) – p(Decision|Outlook=Rain) * Entropy(Decision|Outlook=Rain)Gain(Decision,Outlook)=Entropy(Decision)–p(Decision∣Outlook=Sunny)∗Entropy(Decision∣Outlook=Sunny)–p(Decision∣Outlook=Overcast)∗Entropy(Decision∣Outlook=Overcast)–p(Decision∣Outlook=Rain)∗Entropy(Decision∣Outlook=Rain)
这里有 5 个实例是 sunny,其中有 3 个实例的决策是 no,2 个实例的决策是 yes。
Entropy(Decision∣Outlook=Sunny)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(3/5)∗log2(3/5)–(2/5)∗log2(2/5)=0.441+0.528=0.970Entropy(Decision|Outlook=Sunny) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(3/5)*log2(3/5) – (2/5)*log2(2/5) = 0.441 + 0.528 = 0.970Entropy(Decision∣Outlook=Sunny)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(3/5)∗log2(3/5)–(2/5)∗log2(2/5)=0.441+0.528=0.970
Entropy(Decision∣Outlook=Overcast)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(0/4)∗log2(0/4)–(4/4)∗log2(4/4)=0Entropy(Decision|Outlook=Overcast) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(0/4)*log2(0/4) – (4/4)*log2(4/4) = 0Entropy(Decision∣Outlook=Overcast)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(0/4)∗log2(0/4)–(4/4)∗log2(4/4)=0
Entropy(Decision∣Outlook=Rain)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(2/5)∗log2(2/5)–(3/5)∗log2(3/5)=0.528+0.441=0.970Entropy(Decision|Outlook=Rain) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(2/5)*log2(2/5) – (3/5)*log2(3/5) = 0.528 + 0.441 = 0.970Entropy(Decision∣Outlook=Rain)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(2/5)∗log2(2/5)–(3/5)∗log2(3/5)=0.528+0.441=0.970
Gain(Decision,Outlook)=0.940–(5/14)∗(0.970)–(4/14)∗(0)–(5/14)∗(0.970)–(5/14)∗(0.970)=0.246Gain(Decision, Outlook) = 0.940 – (5/14)*(0.970) – (4/14)*(0) – (5/14)*(0.970) – (5/14)*(0.970) = 0.246Gain(Decision,Outlook)=0.940–(5/14)∗(0.970)–(4/14)∗(0)–(5/14)∗(0.970)–(5/14)∗(0.970)=0.246
这里 sunny 的实例是 5 个,overcast 的实例是 4 个,rain 的实例是 5 个。
SplitInfo(Decision,Outlook)=−(5/14)∗log2(5/14)−(4/14)∗log2(4/14)−(5/14)∗log2(5/14)=1.577SplitInfo(Decision, Outlook) = -(5/14)*log2(5/14) -(4/14)*log2(4/14) -(5/14)*log2(5/14) = 1.577SplitInfo(Decision,Outlook)=−(5/14)∗log2(5/14)−(4/14)∗log2(4/14)−(5/14)∗log2(5/14)=1.577
GainRatio(Decision,Outlook)=Gain(Decision,Outlook)/SplitInfo(Decision,Outlook)=0.246/1.577=0.155GainRatio(Decision, Outlook) = Gain(Decision, Outlook)/SplitInfo(Decision, Outlook) = 0.246/1.577 = 0.155GainRatio(Decision,Outlook)=Gain(Decision,Outlook)/SplitInfo(Decision,Outlook)=0.246/1.577=0.155
Humidity 因子
作为一个例外,humidity 是一个连续的因子。我们需要将连续值转换为标签数据。C4.5 算法建议基于阈值执行二进制分割。阈值应该是为该属性提供最大增益的值。让我们关注 humidity 因子。首先,我们需要将 Humidity 因子从最小到最大排序。
Day | Humidity | Decision |
---|---|---|
7 | 65 | Yes |
6 | 70 | No |
9 | 70 | Yes |
11 | 70 | Yes |
13 | 75 | Yes |
3 | 78 | Yes |
5 | 80 | Yes |
10 | 80 | Yes |
14 | 80 | No |
1 | 85 | No |
2 | 90 | No |
12 | 90 | Yes |
8 | 95 | No |
4 | 96 | Yes |
现在,我们需要迭代所有 humidity 值并将数据集分为两部分,作为小于或等于当前值的实例,以及大于当前值的实例。我们将计划每一步的增益或者增益比。使得增益最大化的值将是阈值。
我们先假设 65 作为 humidity 阈值,那么我们可以来计算信息增益比,如下:
Entropy(Decision∣Humidity<=65)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(0/1)∗log2(0/1)–(1/1)∗log2(1/1)=0Entropy(Decision|Humidity<=65) = – p(No) * log2p(No) – p(Yes) * log2p(Yes) = -(0/1)*log2(0/1) – (1/1)*log2(1/1) = 0Entropy(Decision∣Humidity<=65)=–p(No)∗log2p(No)–p(Yes)∗log2p(Yes)=−(0/1)∗log2(0/1)–(1/1)∗log2(1/1)=0
Entropy(Decision∣Humidity>65)=−(5/13)∗log2(5/13)–(8/13)∗log2(8/13)=0.530+0.431=0.961Entropy(Decision|Humidity>65) = -(5/13)*log2(5/13) – (8/13)*log2(8/13) =0.530 + 0.431 = 0.961Entropy(Decision∣Humidity>65)=−(5/13)∗log2(5/13)–(8/13)∗log2(8/13)=0.530+0.431=0.961
Gain(Decision,Humidity<>65)=0.940–(1/14)∗0–(13/14)∗(0.961)=0.048Gain(Decision, Humidity<> 65) = 0.940 – (1/14)*0 – (13/14)*(0.961) = 0.048Gain(Decision,Humidity<>65)=0.940–(1/14)∗0–(13/14)∗(0.961)=0.048
上面这个符号指决策树的分支小于或者等于65,大于65,两种情况,并不是指 humidity 不等于 65
SplitInfo(Decision,Humidity<>65)=−(1/14)∗log2(1/14)−(13/14)∗log2(13/14)=0.371SplitInfo(Decision, Humidity<> 65) = -(1/14)*log2(1/14) -(13/14)*log2(13/14) = 0.371SplitInfo(Decision,Humidity<>65)=−(1/14)∗log2(1/14)−(13/14)∗log2(13/14)=0.371
GainRatio(Decision,Humidity<>65)=0.126GainRatio(Decision, Humidity<> 65) = 0.126GainRatio(Decision,Humidity<>65)=0.126
再次检查 70 作为 humidity 的阈值。
Entropy(Decision∣Humidity<=70)=–(1/4)∗log2(1/4)–(3/4)∗log2(3/4)=0.811Entropy(Decision|Humidity<=70) = – (1/4)*log2(1/4) – (3/4)*log2(3/4) = 0.811Entropy(Decision∣Humidity<=70)=–(1/4)∗log2(1/4)–(3/4)∗log2(3/4)=0.811
Entropy(Decision∣Humidity>70)=–(4/10)∗log2(4/10)–(6/10)∗log2(6/10)=0.970Entropy(Decision|Humidity>70) = – (4/10)*log2(4/10) – (6/10)*log2(6/10) = 0.970Entropy(Decision∣Humidity>70)=–(4/10)∗log2(4/10)–(6/10)∗log2(6/10)=0.970
Gain(Decision,Humidity<>70)=0.940–(4/14)∗(0.811)–(10/14)∗(0.970)=0.940–0.231–0.692=0.014Gain(Decision, Humidity<> 70) = 0.940 – (4/14)*(0.811) – (10/14)*(0.970) = 0.940 – 0.231 – 0.692 = 0.014Gain(Decision,Humidity<>70)=0.940–(4/14)∗(0.811)–(10/14)∗(0.970)=0.940–0.231–0.692=0.014
SplitInfo(Decision,Humidity<>70)=−(4/14)∗log2(4/14)−(10/14)∗log2(10/14)=0.863SplitInfo(Decision, Humidity<> 70) = -(4/14)*log2(4/14) -(10/14)*log2(10/14) = 0.863SplitInfo(Decision,Humidity<>70)=−(4/14)∗log2(4/14)−(10/14)∗log2(10/14)=0.863
GainRatio(Decision,Humidity<>70)=0.016GainRatio(Decision, Humidity<> 70) = 0.016GainRatio(Decision,Humidity<>70)=0.016
再次检查 75 作为 humidity 的阈值。
Entropy(Decision∣Humidity<=75)=–(1/5)∗log2(1/5)–(4/5)∗log2(4/5)=0.721Entropy(Decision|Humidity<=75) = – (1/5)*log2(1/5) – (4/5)*log2(4/5) = 0.721Entropy(Decision∣Humidity<=75)=–(1/5)∗log2(1/5)–(4/5)∗log2(4/5)=0.721
Entropy(Decision∣Humidity>75)=–(4/9)∗log2(4/9)–(5/9)∗log2(5/9)=0.991Entropy(Decision|Humidity>75) = – (4/9)*log2(4/9) – (5/9)*log2(5/9) = 0.991Entropy(Decision∣Humidity>75)=–(4/9)∗log2(4/9)–(5/9)∗log2(5/9)=0.991
Gain(Decision,Humidity<>75)=0.940–(5/14)∗(0.721)–(9/14)∗(0.991)=0.940–0.2575–0.637=0.045Gain(Decision, Humidity<> 75) = 0.940 – (5/14)*(0.721) – (9/14)*(0.991) = 0.940 – 0.2575 – 0.637 = 0.045Gain(Decision,Humidity<>75)=0.940–(5/14)∗(0.721)–(9/14)∗(0.991)=0.940–0.2575–0.637=0.045
SplitInfo(Decision,Humidity<>75)=−(5/14)∗log2(4/14)−(9/14)∗log2(10/14)=0.940SplitInfo(Decision, Humidity<> 75) = -(5/14)*log2(4/14) -(9/14)*log2(10/14) = 0.940SplitInfo(Decision,Humidity<>75)=−(5/14)∗log2(4/14)−(9/14)∗log2(10/14)=0.940
GainRatio(Decision,Humidity<>75)=0.047GainRatio(Decision, Humidity<> 75) = 0.047GainRatio(Decision,Humidity<>75)=0.047
例子讲的大家应该都明白了。现在,我们跳过计算并直接写出结果。
Gain(Decision,Humidity<>78)=0.090,GainRatio(Decision,Humidity<>78)=0.090Gain(Decision, Humidity <> 78) =0.090, GainRatio(Decision, Humidity <> 78) =0.090Gain(Decision,Humidity<>78)=0.090,GainRatio(Decision,Humidity<>78)=0.090
Gain(Decision,Humidity<>80)=0.101,GainRatio(Decision,Humidity<>80)=0.107Gain(Decision, Humidity <> 80) = 0.101, GainRatio(Decision, Humidity <> 80) = 0.107Gain(Decision,Humidity<>80)=0.101,GainRatio(Decision,Humidity<>80)=0.107
Gain(Decision,Humidity<>85)=0.024,GainRatio(Decision,Humidity<>85)=0.027Gain(Decision, Humidity <> 85) = 0.024, GainRatio(Decision, Humidity <> 85) = 0.027Gain(Decision,Humidity<>85)=0.024,GainRatio(Decision,Humidity<>85)=0.027
Gain(Decision,Humidity<>90)=0.010,GainRatio(Decision,Humidity<>90)=0.016Gain(Decision, Humidity <> 90) = 0.010, GainRatio(Decision, Humidity <> 90) = 0.016Gain(Decision,Humidity<>90)=0.010,GainRatio(Decision,Humidity<>90)=0.016
Gain(Decision,Humidity<>95)=0.048,GainRatio(Decision,Humidity<>95)=0.128Gain(Decision, Humidity <> 95) = 0.048, GainRatio(Decision, Humidity <> 95) = 0.128Gain(Decision,Humidity<>95)=0.048,GainRatio(Decision,Humidity<>95)=0.128
因为 humidity 不能大于 96 ,所以我们不再继续往上计算了。
如上面计算的结果,当阈值等于 80 时,增益比最大化。这意味着我们需要在 humidity = 80 来创建信息增益比。
让我们总结一下计算出的增益和增益比。 Outlook 因子具有最大化的增益和增益比。这意味着我们需要将 Outlook 决策放在决策树的根目录中。
Attribute | Gain | GainRatio |
---|---|---|
Wind | 0.049 | 0.049 |
Outlook | 0.246 | 0.155 |
Humidity <> 80 | 0.101 | 0.107 |
在那以后,我们将像 ID3 一样应用类似的步骤并创建以下决策树。Outlook 被放入根节点。现在,我们应该为不同的分支寻找策略。
Outlook = Sunny
我们将 humidity 的值分为大于 80 和 小于等于 80。令人惊讶的是,如果在 Outlook = sunny 时,humidity 大于 80,那么决策一定是 no。同样,如果 Outlook = sunny 时,humidity 小于等于 80,那么决策一定是 yes。
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
1 | Sunny | 85 | Yes | Weak | No |
2 | Sunny | 80 | Yes | Strong | No |
8 | Sunny | 72 | Yes | Weak | No |
9 | Sunny | 69 | No | Weak | Yes |
11 | Sunny | 75 | No | Strong | Yes |
Outlook = overcast
如果 outlook=overcast ,那么不管别的参数是什么取值,决策都是 yes 。
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
3 | Overcast | 83 | No | Weak | Yes |
7 | Overcast | 64 | No | Strong | Yes |
12 | Overcast | 72 | Yes | Strong | Yes |
13 | Overcast | 81 | No | Weak | Yes |
outlook=rain
我们刚刚看了 outlook = rain 的场景,如果 wind = weak 时,那么决策是 yes。如果 wind = strong 时,那么决策是 no。
Day | Outlook | Temp. | Hum. > 80 | Wind | Decision |
---|---|---|---|---|---|
4 | Rain | 70 | Yes | Weak | Yes |
5 | Rain | 68 | No | Weak | Yes |
6 | Rain | 65 | No | Strong | No |
10 | Rain | 75 | No | Weak | Yes |
14 | Rain | 71 | No | Strong | No |
决策表的最终形式如下所示:
结论
因此,C4.5 算法解决了 ID3 中的大多数问题。该算法使用信息增益比而不是信息增益。通过这种方式,它可以创建爱你更多通用的树,而不会陷入过度拟合。此外,该算法基于增益最大化将连续属性转换为标签属性,并且以这种方式它可以处理连续数据。此外,它可以忽略包括缺少数据和处理丢失数据集的实例。