《(1997)Machine Learning [CMU+T.M. Mitchell] 》读书笔记 - 第三章

本文深入探讨了决策树学习的基本原理及其应用场景,包括决策树的构建过程、如何选择最佳分类属性、避免过拟合的方法等内容,并对连续值属性的处理及属性选择度量等问题进行了讨论。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

/* ******* ******* ******* ******* ******* ******* ******* */ [*******] /* ******* ******* ******* ******* ******* ******* ******* */

<第一章> 网址: [ http://blog.youkuaiyun.com/qiqiduan_077/article/details/50021499 ]

<第二章> 网址: [ http://blog.youkuaiyun.com/qiqiduan_077/article/details/50180427 ]

START_DATE:2015-12-12,END_DATE: -。

/* ******* ******* ******* ******* ******* ******* ******* */ [*******] /* ******* ******* ******* ******* ******* ******* ******* */


本章节主要介绍一类(非常经典)的ML分类算法(即决策树学习DECISION TREE LEARNING)。尽管基于决策树学习的分类算法存在着多个不同的版本(不同的版本之间有着不一样的执行细节,且新的版本也可能会被提出),但庆幸的是:不同算法版本之间*层思维逻辑和背后优化实质*基本上是相通的。正因如此,本章节才能更加关注于决策树学习的底层思维逻辑和背后优化实质,而非<过于具体>的执行细节(虽然执行细节也很重要)。


1.第三章·DECISION TREE LEARNING

A. 什么是“Decision Tree Learning”?

(1). "Decision tree learning is one of the most widely used and practical methods for inductive inference.

It is a method for approximating discrete-valued functions, in which the learned function is represented by a decision tree, that is robust to noisy data and capable of learning disjunctive expressions."

(2). "These decision tree learning methods search a completely expressive hypothesis space and thus avoid the difficulties of restricted hypothesis spaces. Their inductive bias is a preference for small trees over large trees."

(3). "Learned trees can also be re-presented as sets of if-then rules to improve human readability. In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions."

(4). "Most algorithms that have been developed for learning decision trees are variations ona core algorithm that employs a top-down, greedy search through the space of possible decision trees (such as theID3 algorithm [Quinlan 1986] and its successor C4.5 [Quinlan 1993])."

在现实生活中,人们也经常利用树状层次结构(可以看做是简单的if-then规则的升级版本)组织属性信息,来区分不同的对象和状态。我们可以举出大量的事例(例如,择偶标准、购买行为等)来验证以上所述。


B.“Decision Tree Learning”适合解决哪些类型的实际问题? 

“Decision tree learning is generally best suited to problems with the following characteristics:

(1). Instances are represented by attribute-value pairs. The easiest situation fro decision tree learning is when each attribute takes on a small number of disjoint possible values.

(2). The target function has discrete output values.

(3). Disjunction descriptions may be required.

(4). The training data may contain errors.

(5). The training data may contain missing attribute values.”

严格来说,对于第(4)条和第(5)条,决策树分类算法并不是天然就具有解决此类问题的能力。还需要额外的工作(特别注意:这些额外的工作并不是决策树学习算法思想的核心)来缓解(而非解决)第(4)和第(5)问题(当前阶段,按下不表)。


C."Which Attribute Is the Best Classifier?"

(1). "The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. What is a good quantitative measure of the worth of an attribute? We will define a statistical property, called information gain, that measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree."

(2). "In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples."

(3). "One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member, which drawn at random with uniform probability."

熵(entropy)是信息理论的基本概念之一。决策树分类算法借用了这一经典概念计算信息增益(information gain)。所幸的是,只需花费些许时间就可以掌握熵和信息增益的计算公式和计算流程。例如,对应书中计算Gain(S, Temperature)的R代码为:

>> e_sum = - ((5 / 14) * log(5 / 14, 2) + (9 / 14) * log(9 / 14, 2))

>> e_sep = - (4 / 14) * ((1 / 2) * log(1 / 2, 2)+ (1 / 2) * log(1 / 2, 2)) 

                   - (6/ 14) * ((4 / 6) * log(4 / 6, 2) + (2 / 6) * log(2 / 6, 2))

                   - (4 / 14) * ((3 / 4) * log(3 / 4, 2) + (1 / 4) * log(1 / 4, 2))

>> e_sum -e_sep


D.决策树学习的假设搜索空间?

(1). "ID3 performs a simple-to-complex, hill-climbing search through this hypothesis space, beginning with the empty tree, then considering progressively more elaborate hypotheses in search of a decision tree that correctly classifies the training data. The evaluation function that guides this hill-climbing search is the information gain measure."

(2). "ID3 's hypothesis space of all decision trees is a complete space of finite discrete-valued functions, relative to the available attributes."

(3). "ID3 in its pure form performs no backtracking in its search. Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are not globally optimal."

(4). "ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis. One advantage of using statistical properties of all the examples (e.g., information gain) is that the resulting search is much less sensitive to errors in individual training examples."

本质上,ID3算法采用的是爬山搜索策略(其目标函数为每个节点信息增益最大化),即希望通过局部最大化实现全局最大化(尽管这种希望常常落空!)。


E.决策树算法ID3的“Inductive Bias”

(1). "A closer approximation to the inductive bias of ID3: shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not."

ID3对更短树结构的偏好正好反映了"Occam's razor"原则(但是:为什么会有此偏好呢?)。


F.“Restriction Bias vs. Preference Bias”

(1). "A preference bias (or a search bias) is a preference for certain hypotheses over others (e.g., for shorter hypotheses), with no hard restriction on the hypotheses that can be eventually enumerated. A restriction bias (or a language bias) is in the form of a categorical restriction on the set of hypotheses considered."

(2). "Typically, a preference bias is more desirable than a restriction bias, because it allows the learner to work within a complete hypothesis space that is assured to contain the unknown target function."

Restriction-Preference Bias与常用的ML概念Bias-Variance Trade-off有许多相似之处,但前者更多的是从Search的角度进行解说(Restriction对应search an incomplete hypothesis space + search this space completely,而Preference对应search a complete hypothesis space + search incompletely through this space)。


G.为什么偏向“更短的”假设?

(1). "Occam's razor: Prefer the simplest hypothesis that fits the data."

(2). "The size of a hypothesis is determined by the particular representation used internally by the learner."

(3). "The question of which internal representations might arise from a process of evolution and natural selection."

(4). "Evolution will create internal representations that make the learning algorithm's inductive bias a self-fulfilling prophecy, simply because it can alter the representation easier than it can alter the learning algorithm."

(5). "Minimum Description Length principle, a version of Occam's razor that can be interpreted within a Bayesian framework."

虽然可以上升到哲学的高度来解释“为什么偏向于更短的假设”。但是哲学无法验证假设,更不能给出严谨的数学证明。有趣的是,物理学家也有此类偏好(偏向于更简洁的理论解释和数学公式)。对于这一难题,暂且按下不表(第六章将详细探讨)。如果按耐不住,说明你对这个问题真的很感兴趣(显然我不是这样的人)。


G.决策树算法主要的实践议题

(1). "avoiding over-fitting the data"

i. "Given a hypothesis space H, a hypothesis h is said to overfit the training data if there exists some alternative hypothesis h', such that h has smaller error than h' over the training examples, but h' has a smaller error than h over the entire distribution of instances."

ii. "There are several approaches to avoiding over-fitting in decision tree learning. These can be grouped into two classes: A). approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data, B). approaches that allow the tree to overfit the data, and then post-prune the tree. Although the first of these approaches might seem more direct, the second approach of post-pruning overfit trees has been found to be more successful in practice. This is due to the difficult in the first approach of estimating precisely when to stop growing the tree."

iii. "Regardless of whether the correct tree size is found by stopping early or by post-pruning, a key question is what criterion is to be used to determine the correct final tree size. The most common is the training and validation set approach. The motivation is this: Even though the learner may be misled by random errors and coincidental regularities within the training set, the validation set is unlikely to exhibit the same random fluctuations. Therefore, the validation set can be expected to provide a safety check against over-fitting the spurious characteristics of the training set. Of course, it is important that the validation set be large enough to itself provide a statistically significant samples of the instances. One common heuristic is to withhold one-third of the available examples for the validation set, using the other two-thirds for training."

iv. "Reduced-error pruning" + "rule post-pruning" ("Although this heuristic method is not statistically valid, it has nevertheless been found useful in practice.")


(2). "Incorporating continuous-valued attributes"

i. "For an attribute A that is continuous-valued, the algorithm can dynamically create a new boolean attribute Ac that is true if A < c and false otherwise. The only question is how to select the best value for the threshold c."

ii. "By sorting the examples according to the continuous attribute A, then identifying adjacent examples that differ in their target classification, we can generate a set of candidate thresholds midway between the corresponding values of A. It can be shown that the value of c that maximizes information gain must always lie at such a boundary (Fayyad 1991)."


(3). "Alternative Measures for Selecting Attributes"

i. "There is a natural bias in the information gain measure that favors attributes with many values over those with few values."

ii. "One way to avoid this difficulty is to select decision attributes based on some measure other than information gain. One alternative measure that has been used successfully is the gain ratio (Quinlan 1986)."


(4). "Handling training examples with missing attribute values"

i. "It is common to estimate the missing attribute value based on other examples for which this attribute has a known value."

ii. "One strategy for dealing with the missing attribute value is to assign it the value that the most common among training example at node n. Alternatively, we might assign it the most common value among examples at node n that have the classification c(x)."

iii. "A second, more complex procedure is to assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x). "


Among the earliest work on decision tree learning is Hunt's Concept Learning System (CLS) [Hunt et al. 1966]  and Friedman and Breiman's work resulting in theCART ([Friedman 1977; Breiman et al 1984]. For further details on decision tree induction, an excellent book by Quinlan (1993) discusses many practical issues and provides executable code for C4.5.


尽管此章节较为全面地讨论了决策树算法的*层思维逻辑和背后优化实质*,但是:无法根据以上的指导直接设计出高效、实用的决策树算法(毕竟在执行层面,还有更多的细节需要考虑、权衡)。为此,在下一篇博客中,本人将*不自量力*地从算法设计的角度考察C4.5决策树算法(主要基于Python语言,顺带涉及R语言和Matlab语言)。


注:从2015-12-12到2016-01-02,断断续续地、磨磨唧唧地阅读完整个第三章(竟然跨越了“两年之久”),着实不易,且看且珍惜;还是买瓶啤酒犒赏一下自己吧(JUST FOR FUN)。很期待下一章节:人工神经网络(估计需要花费更多的时间)。

<think>好的,我现在要解析用户的问题:“CFLAGS_hal_cmu_common.o += -DCHIP_HAS_PSC 解析一下”。首先,用户之前已经询问过关于一个Makefile片段的详细解析,现在这个问题是针对其中特定一行的深入理解。我需要先确定用户的需求是什么。用户可能是在学习Makefile的编写,特别是在嵌入式系统或驱动开发中如何通过编译选项来控制硬件功能。 首先,我需要拆解这个指令的组成部分。CFLAGS_hal_cmu_common.o 是一个变量,通常用于指定编译hal_cmu_common.o目标时的编译器标志。+= 表示追加操作,将后面的内容添加到现有的CFLAGS中。-DCHIP_HAS_PSC 是gcc编译器的选项,用于定义一个宏CHIP_HAS_PSC,相当于在代码中写了#define CHIP_HAS_PSC。 接下来,我需要理解这一行的作用。在Makefile中,为特定目标文件添加编译选项,通常是为了根据不同的配置启用或禁用某些代码功能。在这里,当CHIP_HAS_PSC被定义时,可能在hal_cmu_common.c的源代码中会有条件编译的代码块,比如#ifdef CHIP_HAS_PSC,这样可以根据是否定义这个宏来包含或排除与PSC(可能是Power State Controller,电源状态控制器)相关的代码。 用户可能想知道这一行的具体作用,以及它在整个项目中的意义。需要解释宏定义在代码中的用途,以及如何通过Makefile来控制这些宏。此外,用户可能还想了解为什么需要特别为这个目标文件添加这个选项,而不是全局定义。 同时,考虑到用户之前的提问涉及整个Makefile的结构,可能用户也在学习如何管理大型项目中的模块化编译选项。需要指出,这种方法允许针对不同模块或文件进行细粒度的控制,例如根据芯片型号或功能需求,选择性地启用某些硬件特性。 还需要考虑可能的上下文,比如CHIP_HAS_PSC这个宏可能在代码中用于条件编译,以适配不同型号的芯片。如果芯片支持PSC模块,那么相关的电源管理代码会被编译进去;否则,这部分代码会被排除,从而减少代码体积或避免编译错误。 另外,可能需要解释Makefile中的变量作用域,CFLAGS_hal_cmu_common.o这个变量是如何被Makefile处理的。通常,在构建系统中,针对特定文件的编译选项会覆盖全局的CFLAGS设置,这样可以灵活调整每个文件的编译参数。 还要注意用户可能的误区,比如是否理解-D选项的作用,以及如何通过Makefile传递这些定义。可能需要简单复习一下C语言中的预处理指令和条件编译,以帮助用户更好地理解这一行Makefile指令的实际效果。 最后,需要确保解释清晰,结构合理,帮助用户逐步理解这一行代码的意义和应用场景,同时关联到之前解析的整个Makefile的结构,说明这一行在整体中的位置和作用。</think>这个编译指令的作用是为特定目标文件`hal_cmu_common.o`添加预处理器定义,以下是逐步解析: $$CFLAGS\_hal\_cmu\_common.o \quad += \quad -DCHIP\_HAS\_PSC$$ **1. 目标文件定位** `CFLAGS_hal_cmu_common.o`表示该编译选项**仅作用于**`hal_cmu_common.c`文件生成的`hal_cmu_common.o`目标文件。这是Makefile中对特定文件设置专属编译参数的典型用法。 **2. 宏定义功能** `-DCHIP_HAS_PSC`等效于在C代码中添加: ```c #define CHIP_HAS_PSC 1 ``` 该宏会触发代码中的条件编译段: ```c #ifdef CHIP_HAS_PSC // PSC相关功能代码 init_psc_controller(); #else // 备用代码 basic_power_setup(); #endif ``` **3. 技术背景** - **PSC(Power State Controller)**:常见于嵌入式芯片的电源管理模块,负责控制不同硬件模块的供电状态 - **CMU(Clock Management Unit)**:时钟管理单元,通常与PSC配合工作 - 当芯片包含PSC模块时,需要启用对应的时钟-电源协同控制逻辑 **4. 工程实践意义** ```makefile ifeq ($(CHIP_HAS_PSC),1) # 在顶层配置中检测芯片是否支持PSC CFLAGS_hal_cmu_common.o += -DCHIP_HAS_PSC endif ``` 这种写法实现了: - **硬件抽象层适配**:同一份驱动代码适配不同芯片型号 - **功能模块化**:通过编译开关控制功能包含/排除 - **资源优化**:无PSC的芯片不会编译相关代码,减少固件体积 **5. 典型应用场景** 假设某芯片产品线存在两个版本: - 高端型号:含PSC模块,需精细功耗管理 - 低端型号:无PSC模块,使用基础电源控制 通过该编译选项,同一份`hal_cmu_common.c`文件可自动适配两种硬件配置,无需维护多份代码。 **扩展说明**:类似指令在驱动开发中常见,如: ```makefile CFLAGS_hal_uart.o += -DUSE_DMA_MODE # 为串口驱动启用DMA模式 CFLAGS_hal_adc.o += -DADC_12BIT_RES # 配置ADC为12位精度模式 ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值