使用ID3算法构造决策树 - 简介、概念及实例

本文深入介绍了ID3决策树算法的基本原理及其应用。从算法的起源讲起,详细解释了信息增益的概念,并通过实例展示了如何利用ID3算法生成决策树。此外,还探讨了ID3算法的一些局限性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

 Quotes: http://kelvinzh.spaces.live.com/blog/cns!D49A2DF3B9825B01!944.entry   part1

               http://kelvinzh.spaces.live.com/blog/cns!D49A2DF3B9825B01!953.entry   part2

Part1:

ID3 Overview

In 1983, Professor J. Ross Quinlan introduced a simple decision tree learning algorithm called Iterative Dichotomiser 3 (ID3) at University of Sydney.  ID3 is a heuristic method. The basic idea is to generate a decision tree by employing a top-down, greedy search through the training data set at each of its tree node, seeking the attribute that best separates the instances. Top-down, greedy search means ID3 recursively (top-down) checks all possible values of an attribute to get the best outcome. It picks any newer and better attribute and “forgets” the earlier choices (greedy).

The intention of ID3 is to produce relatively small decision tree.

 

Basic ID3 Concepts

How does ID3 select the best attribute? To answer this question, a new metric called Information Gain is introduced.  Information Gain is based on Occam's Razor which uses a concept of Entropy. Entropy measures the amount of information in an attribute. Given a collection S containing m possible outcomes, (X1, X2.......Xm):

Entropy(S)= ∑-ρ(Xi)Log2ρ(Xi)

Where ρ(Xi) is the proportion of S belonging to class Xi.

To simplify for better understanding, let’s assume, at each node, all instances can be partitioned into two different categories using the decision tree, namely Positive (P) and Negative (N).

Entropy(S)= -ρ(P)Log2ρ(P) -ρ(N)Log2ρ(N)

Example

If S can be described using the following diagram:

 
  

 ID3_1

Then the corresponding Entropies are

Entropy(Left)= -0.5Log20.5 - 0.5Log20.5 = 1

Entropy(Right)= -0.67Log20.67 - 0.33Log20.33 = 0.92

Note that Entropy is the measure of the impurity in a collection of the training data set. Thus, entropy is 0 if all members of S belong to the same class (ie, S is (n+, 0-) or (0+, n-) ). In other words, the data is perfectly classified.

Information Gain is defined as the measure of expected reduction in Entropy [1].

Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv))  

Where:

S is each value v of all possible values of attribute A

Sv = subset of S for which attribute A has value v

|Sv| = number of elements in Sv

|S| = number of elements in S

 

Example:

Suppose we have a real estate data set S containing 14 examples with 4 attributes one of which is Traffic Convenience. It can have 2 possible values: Good and Bad. The classification of these 14 examples indicates whether to buy this property or not. Among the 14 outcomes are 9 YES (buy) and 5 NO (not to buy). Suppose there are 8 occurrences of Traffic = Good and 6 occurrences of Traffic = Bad.

Attribute

Possible Values

Traffic

I (Inconvenient)

C (Convenient)

Decision

p(yes/to buy)

n(no/not to buy)

For Traffic = Convenient, 6 of the examples are YES and 2 are NO. For Traffic = Inconvenient, 3 are YES and 3 are NO.

Traffic

C

I

C

C

C

I

I

Decision

n

n

p

p

p

n

p

Traffic

C

C

C

I

I

C

I

Decision

n

p

p

p

p

p

n

Therefore

      Entropy(S) = - (9/14)*Log2 (9/14) - (5/14)*Log2 (5/14) = 0.940

      Entropy(SConvenient) = - (6/8)*Log2 (6/8) - (2/8)*Log2 (2/8) = 0.811

      Entropy(SInconvenient) = - (3/6)*Log2 (3/6) - (3/6)*Log2 (3/6) = 1.00

      Gain(S, Traffic) = Entropy(S)-(8/14)*Entropy(SConvenient) -(6/14)*Entropy(SInconvenient)

= 0.940 - (8/14)*0.811 - (6/14)*1.00

= 0.048

 

ID3 Algorithm:

The ID3 algorithm can be summarized as follows: [2]

  1. Take all unused attributes and count their entropy concerning test samples
  2. Choose attribute for which entropy is minimum
  3. Make node containing that attribute
  4. Operate using above method recursively till all attributes is used

If we take a step by step view on ID3 Algorithm, this can be decomposed into the followings:

ID3 (Examples, Target_Attribute, Attributes)

-Create a root node for the tree, containing the whole training set as its subset.

-If all instances are positive, Return the single-node tree Root, with label = +

-If all examples are negative, Return the single-node tree Root, with label = -

-If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples

-Otherwise Begin

   o    A ß The Attribute that best classifies examples

   o    Decision Tree attribute for Root ß A

   o    For each positive value, vi, of A,

       §  Add a new tree branch below Root, corresponding to the test A = vi

       §  Let Examples(vi), be the subset of examples that have the value vi for A

       §  If Examples(vi) is empty

       ·         Then below this new branch add a leaf node with label = most common target value in the examples

       ·         Else below this new branch add the Subtree                                            ID3 (Examples(vi), Target_Attribute, Attributes – {A})

-End

-Return Root [2]

Another detailed explanation about ID3 algorithm is provided by Professor Ernest Davis from New York University on his website. Some pseudo code can be found on http://cs.nyu.edu/faculty/davise/ai/id3.pdf.

 

Part2:

An Example of Decision Tree generation using ID3 Algorithm:

 Now let us have a deep look at the previous example about the decision on buying a real estate. In addition to the Traffic Convenience, we introduce another three attributes, namely Price, Location and Size.

Attribute

Possible Values

Location

Good

Average

Bad

Price

High

Moderate

Low

Size

Small

Large

 

Traffic

Inconvenient

Convenient

 

Decision

p(yes/to buy)

n(no/not to buy)

 

All details of the training data set are listed below:

Property NO    Location    Price      Size       Traffic      Decision
H1                  Good       High      Small    Convenient      n
H2                  Good       High      Small    Inconvenient    n
H3                 Average    High      Small    Convenient      p
H4                   Bad     Moderate  Small    Convenient      p
H5                   Bad        Low      Large    Convenient      p
H6                   Bad        Low      Large    Inconvenient    n
H7                 Average    Low      Large    Inconvenient    p
H8                  Good    Moderate  Small    Convenient       n
H9                  Good       Low      Large    Convenient       p
H10                 Bad     Moderate  Large    Convenient       p
H11                Good    Moderate  Large    Inconvenient     p
H12               Average Moderate  Small    Inconvenient     p
H13               Average    High      Large    Convenient       p
H14                 Bad     Moderate  Small    Inconvenient     n

According to the previous explanation,  we now first create a rootNode, containing the whole training set  as its subset. then computer Entropy:

Decision

P

N

Property NO

H3, H4, H5, H7, H9, H10, H11, H12, H13

H1, H2, H6, H8, H14

Entropy(rootNode.subset) = -(9/14)log2(9/14) - (5/14)log2(5/14)=0.940

For this node (rootNode), there are 4 unused attributes. Calculate all the Information Gain of these 4.

Gain(S,Traffic) = Entropy(S)-(8/14)Entropy(SConvenient) - (6/14)Entropy(SInconvenient) =0.048

Gain(S,Size) =0.151

Gain(S, Price)=0.029

Gain(S, Location) = 0.246

Select the attribute with the highest Information Gain which is Location as the first tree splitting criteria.  Secondly, we need to decide which attribute is the next to be picked up.

Again calculate the information gain of the rest 3 attributes under the condition of Location=Good.

Entropy(SGood)=0.970

Gain(SGood, Size) = 0.970

Gain(SGood, Price) = 0.570

Gain(SGood, Traffic) = 0.019

Hence Size is selected to create the node. This process repeats until all data is perfectly classified or all the attributes are used.

ID3_FinalTree

 

Shortages of ID3:

Although ID3 is broadly used and considered to be easy, straight-forward and efficient. There are still some shortages for this method.

1. Attributes with more possible values tend to have higher Information Gain

2. Each node can be represented by only one attribute which means the colorations or relationships between attributes are ignored.ID3 prefers discrete attribute values.

3. Although we can still employ ID3 on some continuous value attributes by using particular data transformation methods, it is still considered not practical.

4. ID3 does not have a good mechanism to deal with noises and missing values

  

 

References:

  1. T. Mitchell, "Decision Tree Learning", in T. Mitchell, Machine Learning, The McGraw-Hill Companies, Inc., 1997, pp. 52-78.
  2. http://en.wikipedia.org/wiki/ID3_algorithm  18/09/2007
  3. P. Winston, "Learning by Building Identification Trees", in P. Winston, Artificial Intelligence, Addison-Wesley Publishing Company, 1992, pp. 423-442.

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值