《(1997)Machine Learning [CMU+T.M. Mitchell] 》读书笔记 - 第三章

最新推荐文章于 2021-12-08 15:29:47 发布

QiQiDuan

最新推荐文章于 2021-12-08 15:29:47 发布

阅读量864

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习 [ML]

本文链接：https://blog.youkuaiyun.com/QiqiDuan_077/article/details/50277551

机器学习 [ML] 专栏收录该内容

3 篇文章

订阅专栏

本文深入探讨了决策树学习的基本原理及其应用场景，包括决策树的构建过程、如何选择最佳分类属性、避免过拟合的方法等内容，并对连续值属性的处理及属性选择度量等问题进行了讨论。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

/* ******* ******* ******* ******* ******* ******* ******* */ [*******] /* ******* ******* ******* ******* ******* ******* ******* */

<第一章> 网址: [ http://blog.youkuaiyun.com/qiqiduan_077/article/details/50021499 ]

<第二章> 网址: [ http://blog.youkuaiyun.com/qiqiduan_077/article/details/50180427 ]

START_DATE：2015-12-12，END_DATE: -。

/* ******* ******* ******* ******* ******* ******* ******* */ [*******] /* ******* ******* ******* ******* ******* ******* ******* */

本章节主要介绍一类（非常经典）的ML分类算法（即决策树学习，DECISION TREE LEARNING）。尽管基于决策树学习的分类算法存在着多个不同的版本（不同的版本之间有着不一样的执行细节，且新的版本也可能会被提出），但庆幸的是：不同算法版本之间*底层思维逻辑和背后优化实质*基本上是相通的。正因如此，本章节才能更加关注于决策树学习的底层思维逻辑和背后优化实质，而非<过于具体>的执行细节（虽然执行细节也很重要）。

1.第三章·DECISION TREE LEARNING

A. 什么是“Decision Tree Learning”?

(1). "Decision tree learning is one of the most widely used and practical methods for inductive inference.

It is a method for approximating discrete-valued functions, in which the learned function is represented by a decision tree, that is robust to noisy data and capable of learning disjunctive expressions."

(2). "These decision tree learning methods search a completely expressive hypothesis space and thus avoid the difficulties of restricted hypothesis spaces. Their inductive bias is a preference for small trees over large trees."

(3). "Learned trees can also be re-presented as sets of if-then rules to improve human readability. In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions."

(4). "Most algorithms that have been developed for learning decision trees are variations ona core algorithm that employs a top-down, greedy search through the space of possible decision trees (such as theID3 algorithm [Quinlan 1986] and its successor C4.5 [Quinlan 1993])."

在现实生活中，人们也经常利用树状层次结构（可以看做是简单的if-then规则的升级版本）组织属性信息，来区分不同的对象和状态。我们可以举出大量的事例（例如，择偶标准、购买行为等）来验证以上所述。

B.“Decision Tree Learning”适合解决哪些类型的实际问题?

“Decision tree learning is generally best suited to problems with the following characteristics:

(1). Instances are represented by attribute-value pairs. The easiest situation fro decision tree learning is when each attribute takes on a small number of disjoint possible values.

(2). The target function has discrete output values.

(3). Disjunction descriptions may be required.

(4). The training data may contain errors.

(5). The training data may contain missing attribute values.”

严格来说，对于第(4)条和第(5)条，决策树分类算法并不是天然就具有解决此类问题的能力。还需要额外的工作（特别注意：这些额外的工作并不是决策树学习算法思想的核心）来缓解（而非解决）第(4)和第(5)问题（当前阶段，按下不表）。

C."Which Attribute Is the Best Classifier?"

(1). "The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. What is a good quantitative measure of the worth of an attribute? We will define a statistical property, called information gain, that measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree."

(2). "In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples."

(3). "One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member, which drawn at random with uniform probability."

熵（entropy）是信息理论的基本概念之一。决策树分类算法借用了这一经典概念计算信息增益（information gain）。所幸的是，只需花费些许时间就可以掌握熵和信息增益的计算公式和计算流程。例如，对应书中计算Gain(S, Temperature)的R代码为：

>> e_sum = - ((5 / 14) * log(5 / 14, 2) + (9 / 14) * log(9 / 14, 2))

>> e_sep = - (4 / 14) * ((1 / 2) * log(1 / 2, 2)+ (1 / 2) * log(1 / 2, 2))

- (6/ 14) * ((4 / 6) * log(4 / 6, 2) + (2 / 6) * log(2 / 6, 2))

- (4 / 14) * ((3 / 4) * log(3 / 4, 2) + (1 / 4) * log(1 / 4, 2))

>> e_sum -e_sep

D.决策树学习的假设搜索空间?

(1). "ID3 performs a simple-to-complex, hill-climbing search through this hypothesis space, beginning with the empty tree, then considering progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data. The evaluation function that guides this hill-climbing search is the information gain measure."

(2). "ID3 's hypothesis space of all decision trees is a complete space of finite discrete-valued functions, relative to the available attributes."

(3). "ID3 in its pure form performs no backtracking in its search. Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are not globally optimal."

(4). "ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis. One advantage of using statistical properties of all the examples (e.g., information gain) is that the resulting search is much less sensitive to errors in individual training examples."

本质上，ID3算法采用的是爬山搜索策略（其目标函数为每个节点信息增益最大化），即希望通过局部最大化实现全局最大化（尽管这种希望常常落空！）。

E.决策树算法ID3的“Inductive Bias”

(1). "A closer approximation to the inductive bias of ID3: shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not."

ID3对更短树结构的偏好正好反映了"Occam's razor"原则（但是:为什么会有此偏好呢?）。

F.“Restriction Bias vs. Preference Bias”

(1). "A preference bias (or a search bias) is a preference for certain hypotheses over others (e.g., for shorter hypotheses), with no hard restriction on the hypotheses that can be eventually enumerated. A restriction bias (or a language bias) is in the form of a categorical restriction on the set of hypotheses considered."

(2). "Typically, a preference bias is more desirable than a restriction bias, because it allows the learner to work within a complete hypothesis space that is assured to contain the unknown target function."

Restriction-Preference Bias与常用的ML概念Bias-Variance Trade-off有许多相似之处，但前者更多的是从Search的角度进行解说（Restriction对应search an incomplete hypothesis space + search this space completely，而Preference对应search a complete hypothesis space + search incompletely through this space）。

G.为什么偏向“更短的”假设?

(1). "Occam's razor: Prefer the simplest hypothesis that fits the data."

(2). "The size of a hypothesis is determined by the particular representation used internally by the learner."

(3). "The question of which internal representations might arise from a process of evolution and natural selection."

(4). "Evolution will create internal representations that make the learning algorithm's inductive bias a self-fulfilling prophecy, simply because it can alter the representation easier than it can alter the learning algorithm."

(5). "Minimum Description Length principle, a version of Occam's razor that can be interpreted within a Bayesian framework."

虽然可以上升到哲学的高度来解释“为什么偏向于更短的假设”。但是哲学无法验证假设，更不能给出严谨的数学证明。有趣的是，物理学家也有此类偏好（偏向于更简洁的理论解释和数学公式）。对于这一难题，暂且按下不表（第六章将详细探讨）。如果按耐不住，说明你对这个问题真的很感兴趣（显然我不是这样的人）。

G.决策树算法主要的实践议题

(1). "avoiding over-fitting the data"

i. "Given a hypothesis space H, a hypothesis h is said to overfit the training data if there exists some alternative hypothesis h', such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances."

ii. "There are several approaches to avoiding over-fitting in decision tree learning. These can be grouped into two classes: A). approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data, B). approaches that allow the tree to overfit the data, and then post-prune the tree. Although the first of these approaches might seem more direct, the second approach of post-pruning overfit trees has been found to be more successful in practice. This is due to the difficult in the first approach of estimating precisely when to stop growing the tree."

iii. "Regardless of whether the correct tree size is found by stopping early or by post-pruning, a key question is what criterion is to be used to determine the correct final tree size. The most common is the training and validation set approach. The motivation is this: Even though the learner may be misled by random errors and coincidental regularities within the training set, the validation set is unlikely to exhibit the same random fluctuations. Therefore, the validation set can be expected to provide a safety check against over-fitting the spurious characteristics of the training set. Of course, it is important that the validation set be large enough to itself provide a statistically significant samples of the instances. One common heuristic is to withhold one-third of the available examples for the validation set, using the other two-thirds for training."

iv. "Reduced-error pruning" + "rule post-pruning" ("Although this heuristic method is not statistically valid, it has nevertheless been found useful in practice.")

(2). "Incorporating continuous-valued attributes"

i. "For an attribute A that is continuous-valued, the algorithm can dynamically create a new boolean attribute Ac that is true if A < c and false otherwise. The only question is how to select the best value for the threshold c."

ii. "By sorting the examples according to the continuous attribute A, then identifying adjacent examples that differ in their target classification, we can generate a set of candidate thresholds midway between the corresponding values of A. It can be shown that the value of c that maximizes information gain must always lie at such a boundary (Fayyad 1991)."

(3). "Alternative Measures for Selecting Attributes"

i. "There is a natural bias in the information gain measure that favors attributes with many values over those with few values."

ii. "One way to avoid this difficulty is to select decision attributes based on some measure other than information gain. One alternative measure that has been used successfully is the gain ratio (Quinlan 1986)."

(4). "Handling training examples with missing attribute values"

i. "It is common to estimate the missing attribute value based on other examples for which this attribute has a known value."

ii. "One strategy for dealing with the missing attribute value is to assign it the value that the most common among training example at node n. Alternatively, we might assign it the most common value among examples at node n that have the classification c(x)."

iii. "A second, more complex procedure is to assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x). "

Among the earliest work on decision tree learning is Hunt's Concept Learning System (CLS) [Hunt et al. 1966] and Friedman and Breiman's work resulting in theCART ([Friedman 1977; Breiman et al 1984]. For further details on decision tree induction, an excellent book by Quinlan (1993) discusses many practical issues and provides executable code for C4.5.

尽管此章节较为全面地讨论了决策树算法的*底层思维逻辑和背后优化实质*，但是：无法根据以上的指导直接设计出高效、实用的决策树算法（毕竟在执行层面，还有更多的细节需要考虑、权衡）。为此，在下一篇博客中，本人将*不自量力*地从算法设计的角度考察C4.5决策树算法（主要基于Python语言，顺带涉及R语言和Matlab语言）。

注：从2015-12-12到2016-01-02，断断续续地、磨磨唧唧地阅读完整个第三章（竟然跨越了“两年之久”），着实不易，且看且珍惜；还是买瓶啤酒犒赏一下自己吧（JUST FOR FUN）。很期待下一章节：人工神经网络（估计需要花费更多的时间）。