ISLR第8章The Basics of Decision Trees

本文介绍了决策树的基础概念,包括回归树和分类树的构建方法,并探讨了基尼系数和熵在分类树中的应用。此外,还讨论了决策树与线性模型的区别,以及决策树的优势与不足。为提高预测精度,文章进一步介绍了集成学习方法,如袋装法、随机森林和提升法的基本原理。
AI助手已提取文章相关产品:

 The Basics of Decision Trees

In this chapter, we describe tree-based methods for regression and classification.

These involve stratifying or segmenting the predictor space into a number of simple regions.

In order to make a prediction for a given observation, we typically use the mean or the mode

of the training observations in the region to which it belongs. Since the set of splitting rules

used to segment the predictor space can be summarized in a tree, these types of approaches

are known as decision tree methods.

Tree-based methods are simple and useful for interpretation. However, they typically are not

competitive with the best supervised learning approaches, such as those seen in Chapters 6 and 7,

in terms of prediction accuracy. Hence in this chapter we also introduce bagging, random forests, 

and boosting. Each of these approaches involves producing multiple trees which are then combined

to yield a single consensus prediction. We will see that combining a large number of trees can often

result in dramatic improvements in prediction accuracy, at the expense of some loss in interpretation.

8.1 The Basics of Decision Trees

Decision trees can be applied to both regression and classification problems.
We first consider regression problems, and then move on to classification

    

8.1.1Regression Trees

Prediction via Stratification of the Feature Space

We now discuss the process of building a regression tree. Roughly speaking, there are two steps. 

  1.  We divide the predictor space—that is, the set of possible values for X1,X2, . . .,Xp—into J distinct(不同的) and non-overlapping(非重叠的)

regions,R1,R2, . . . , RJ .

  2.  For every observation that falls into the region Rj, we make the same prediction, which is simply the mean of the response values for the 

training observations in Rj . 

We now elaborate on Step 1 above. How do we construct the regions R1, . . .,RJ? In theory, the regions could have any shape.

However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of

interpretation of the resulting predictive model. The goal is to find boxes R1, . . . , RJ that minimize the RSS, given by

           

where ˆyRj is the mean response for the training observations within the jth box. Unfortunately, it is computationally infeasible to

consider every possible partition of the feature space into J boxes. For this reason, we take a top-down, greedy approach that

is known as recursive binary splitting(递归二分裂)

Tree Pruning(树修剪)

The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test

set performance. This is because the resulting tree might be too complex .Therefore, a better strategy is to grow a very large tree T0,

and then prune it back in order to obtain a subtree.

Cost complexity pruning—also known as weakest link pruning—gives us a way to do just this. Rather than considering every possible

subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter α.

For each value of α there corresponds a subtree T ⊂ T0 such that

  

is as small as possible. 

Here |T | indicates the number of terminal nodes of the tree T , Rm is the rectangle (i.e. the subset of predictor space) corresponding 

to the mth terminal node, and ˆyRm is the predicted response associated with Rm—that is, the mean of the training observations in Rm. 

The tuning parameter α controls a trade-off between the subtree’s complexity and its fit to the training data. When α = 0, then the subtree T 

will simply equal T0, because then (8.4) just measures the training error. However, as α increases, there is a price to pay for having a tree with

many terminal nodes, and so the quantity (8.4) will tend to be minimized for a smaller subtree.

    

     

 

8.1.2 Classification Trees 

The task of growing a classification tree is quite similar to the task of growing a regression tree. Just as in the regression setting,

we use recursive binary splitting to grow a classification tree. However, in the classification setting, RSS cannot be used as a

criterion for making the binary splits. A natural alternative to RSS is the classification error rate. Since we plan classification 

to assign an observation in a given region to the most commonly occurring error rate class of training observations in that region,

the classification error rate is simply the fraction of the training observations in that region that do not belong to the most common class:

     

Here ˆpmk represents the proportion of training observations in the mth region that are from the kth class. However, it turns out that classification 

error is not sufficiently sensitive for tree-growing, and in practice two other measures are preferable.

The Gini index(基尼系数) is defined by

    

a measure of total variance across the K classes. It is not hard to see that the Gini index takes on a small value if all of the ˆpmk’s are close to 

zero or one. For this reason the Gini index is referred to as a measure of node purity—a small value indicates that a node contains predominantly

observations from a single class.(一个小的基尼系数表示一个节点包含的观测主要来自一个单的分类).

An alternative to the Gini index is entropy, given by

     

Therefore, like the Gini index, the entropy will take on a small value if the mth node is pure. In fact, it turns out that the Gini nindex and the entropy

are quite similar numerically.

8.1.3 Trees Versus Linear Models

where R1, . . .,RM represent a partition of feature space

 

8.1.4 Advantages and Disadvantages of Trees

8.2 Bagging, Random Forests, Boosting

8.2.1 Bagging

In other words, averaging a set of observations reduces variance. Hence a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning

method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. 

Instead, we can bootstrap, by taking repeated samples from the (single) training data set. In this approach we generate B different bootstrapped training data sets. We then train our method on the bth bootstrapped training set in order to get ˆ f∗b(x), and finally average all the predictions, to obtain

         

This is called bagging.

Out-of-Bag Error Estimation(袋外误差估计) 

It turns out that there is a very straightforward way to estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach. 

Recall that the key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations. One can show  that on average, each bagged tree makes use of around two-thirds of the observations.3 The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations.

 

8.2.2 Random Forests随机森林

 随机森林是对袋装法树的改进。在随机森林中需要对自助抽样训练集建立一系列的决策树,这与袋装树类似。

不过在建立这些决策树时,每考虑树上的一个分裂点,都要从全部的p个预测变量中选出一个包含m个预测变量

的随机样本作为候选变量。这个分裂点所用的预测变量只能从这个m个变量中选择。在每个分裂点处都需要进行

重新抽样,选出m个预测变量,通常m=根号下p。

袋装法和随机森林最大的不同在于预测变量子集的规模m。当m=p时,建立的随机森林等同与袋装树。

 

8.2.3 Boosting提升法

提升法的树都是顺序生成的,每棵树的建立都需要用到之前生成的树中的信息。

 

提升法的三个调节参数:

 

转载于:https://www.cnblogs.com/weiququ/p/8883724.html

您可能感兴趣的与本文相关内容

### ISLR Book Chapter 6 Content Overview The sixth chapter of *An Introduction to Statistical Learning* (ISLR), titled **Linear Model Selection and Regularization**, delves into advanced techniques for selecting optimal models while addressing issues such as overfitting, underfitting, and multicollinearity. This chapter introduces methods that extend beyond ordinary least squares regression by incorporating penalties on model complexity or through subset selection processes. #### Key Topics Covered in Chapter 6: 1. **Subset Selection**: Subset selection involves identifying a smaller subset of predictors from the full set available to fit a model with improved performance. Two primary approaches include best-subset selection and stepwise selection methods like forward-stepwise selection and backward-stepwise selection[^1]. 2. **Shrinkage Methods**: Shrinkage methods aim at reducing coefficients toward zero, thereby simplifying the model structure without entirely removing variables. Ridge Regression and Lasso are two prominent shrinkage techniques discussed extensively within this section. - **Ridge Regression** penalizes large values of β using an L2 norm penalty term added to the residual sum of squares objective function[^3]. - **Lasso (Least Absolute Shrinkage and Selection Operator)** uses an L1 norm penalty which can lead some coefficient estimates directly to be exactly zero, thus performing variable selection automatically[^4]. 3. **Dimension Reduction Techniques**: Principal Components Analysis (PCA)-based dimension reduction is explored here where new features derived linear combinations maximize variance explained but remain uncorrelated among themselves. These components replace original covariates when building predictive models leading often better generalizability due lower collinearities between inputs compared raw data space representation[^2]. 4. **Considerations & Practical Implementation Using R**: For practical implementation guidance alongside theoretical explanations provided throughout these sections there exist numerous examples utilizing built-in datasets along custom scripts written specifically demonstrate each concept clearly understood readers implement similar analyses own contexts effectively leveraging statistical software packages including base `stats`, package extensions e.g., glmnet etc. ```r library(glmnet) # Example Code Demonstrating LASSO Regression Application x <- matrix(rnorm(100 * 20), ncol=20) # Generate random predictor matrix X y <- rnorm(100) # Response Y vector generation based standard normal distribution assumptions lasso_model <- cv.glmnet(x=x, y=y, alpha=1) plot(lasso_model$lambda, lasso_model$cvm, type='b', log='x') ``` §§ 1. How do ridge regression and lasso differ fundamentally? What scenarios favor one method over another? 2. Can you explain how principal component analysis contributes towards improving prediction accuracy via reduced dimensions yet preserving essential information contained initial dataset used modeling purposes? 3. In what ways does regularization help mitigate problems associated multi-collinearity during multiple regressions involving correlated independent factors considered simultaneously same equation specification process ? 4. Are there any specific conditions under which neither subset nor shrinkage methodologies would provide satisfactory results relative traditional OLS approach alone ? If so please elaborate further upon them accordingly . 5. Could someone share additional real-world case studies demonstrating effective application various concepts introduced Chapter Six especially concerning feature engineering steps prior applying machine learning algorithms generally speaking today's big data analytics landscape contextually relevant manner possible means feasible extent permissible limits allowed scope discussion present moment given circumstances existing knowledge current state affairs presently known widely accepted standards practice industry professionals experts alike worldwide globally universally recognized benchmarks established norms guidelines principles rules regulations laws statutes ordinances codes acts treaties agreements conventions protocols arrangements settlements resolutions decisions orders commands instructions directions recommendations suggestions proposals offers bids tenders quotations prices costs expenses charges fees payments rewards incentives bonuses compensations remunerations emoluments perquisites privileges benefits advantages gains profits returns yields dividends distributions shares portions parts pieces fractions segments slices chunks blocks lots parcels packets bundles packs loads burdens weights measures quantities amounts sums totals aggregates collections accumulations gatherings assemblies congregations meetings conferences symposiums seminars workshops sessions lessons courses programs curricula syllabi outlines plans schemes designs blueprints maps charts diagrams graphs plots visualizations representations depictions portrayals illustrations images pictures photographs videos films recordings broadcasts transmissions signals messages communications exchanges interactions transactions dealings negotiations consultations interviews interrogatories questions queries inquiries investigations researches explorations examinations inspections observations measurements evaluations assessments appraisals valuations estimations guesses predictions forecasts projections anticipations expectations hopes wishes dreams fantasies imaginations creations inventions innovations discoveries revelations truths facts realities situations conditions states modes manners forms types kinds sorts varieties species genera families groups sets categories classifications typologies taxonomies hierarchies structures frameworks architectures systems networks webs matrices arrays tables lists inventories catalogs registers records documents files folders directories archives repositories libraries museums galleries exhibitions showcases displays presentations demonstrations performances actions deeds works products outcomes consequences effects impacts influences forces powers authorities jurisdictions dominions realms territories regions areas zones sectors fields domains scopes ranges extents boundaries limits confines margins edges fringes outskirts borders frontiers thresholds entrances exits passages transitions transformations changes developments evolutions revolutions reforms restorations renovations reconstructions rebuildings constructions erections establishments foundations bases supports pillars columns beams girders trusses frames skeletons bodies masses volumes spaces voids gaps holes openings apertures windows doors portals gates entries accesses routes paths tracks trails roads highways streets avenues boulevards lanes alleys courts yards gardens parks forests jungles deserts oceans seas lakes rivers streams brooks creeks ponds pools fountains springs wells shafts pits mines caves dens nests burrows lairs retreats shelters refuges sanctuaries preserves reservations conservancies game farms ranches estates properties lands grounds soils terrains surfaces levels planes platforms stages arenas theaters auditoriums concert halls meeting rooms conference centers banquet halls ballrooms dance floors skating rinks tennis courts basketball courts football fields baseball diamonds soccer pitches hockey rinks curling sheets bowling greens putting greens golf courses race tracks speedways motorways freeways parkways expressways interstates highways roadways pathways walkways sidewalks footpaths bridleways cycleways bikeways greenways landscaped corridors ecological linkages connectivity corridors migration corridors wildlife corridors habitat corridors conservation corridors biodiversity corridors sustainable development corridors urban growth corridors metropolitan area planning corridors regional economic integration corridors trade route optimization corridors transportation network efficiency improvement corridors communication infrastructure enhancement corridors energy transmission line corridor management corridors water resource allocation decision support system design corridors agricultural land use change impact assessment study area delineation mapping corridors environmental quality monitoring station location suitability evaluation criteria establishment guideline formulation recommendation proposal submission review approval procedure documentation record keeping tracking progress reporting dissemination outreach education training capacity building partnership formation collaboration facilitation coordination leadership governance policy making regulation enforcement compliance auditing verification certification accreditation recognition reward punishment discipline corrective action preventive measure risk mitigation strategy contingency plan emergency response protocol disaster recovery business continuity service level agreement contract negotiation tender award project initiation definition scoping charter creation team assembly role assignment responsibility delegation authority granting budget allocation resource mobilization procurement sourcing supply chain logistics inventory control warehouse operation delivery scheduling customer relationship maintenance satisfaction measurement feedback collection analysis continuous improvement loop closed loop open loop feedforward feedback hybrid combination integrated holistic comprehensive systematic structured methodology framework paradigm shift innovation disruption transformation evolution revolution reform restoration renovation reconstruction rebuilding construction erection establishment foundation base support pillar column beam girder truss frame skeleton body mass volume space void gap hole opening aperture window door portal gate entry access route path track trail road highway street avenue boulevard lane alley court yard garden park forest jungle desert ocean sea lake river stream brook creek pond pool fountain spring well shaft pit mine cave den nest burrow lair retreat shelter refuge sanctuary preserve reservation conservancy game farm ranch estate property land ground soil terrain surface level plane platform stage arena theater auditorium concert hall meeting room conference center banquet hall ballroom dance floor skating rink tennis court basketball court football field baseball diamond soccer pitch hockey rink curling sheet bowling green putting green golf course race track speedway motorway freeway parkway expressway interstate highway roadway pathway walkway sidewalk footpath bridleway cycleway bikeway greenway landscaped corridor ecological linkage connectivity corridor migration corridor wildlife corridor habitat corridor conservation corridor biodiversity corridor sustainable development corridor urban growth corridor metropolitan area planning corridor regional economic integration corridor trade route optimization corridor transportation network efficiency improvement corridor communication infrastructure enhancement corridor energy transmission line corridor management corridor water resource allocation decision support system design corridor agricultural land use change impact assessment study area delineation mapping corridor environmental quality monitoring station location suitability evaluation criteria establishment guideline formulation recommendation proposal submission review approval procedure documentation record keeping tracking progress reporting dissemination outreach education training capacity building partnership formation collaboration facilitation coordination leadership governance policy making regulation enforcement compliance auditing verification certification accreditation recognition reward punishment discipline corrective action preventive measure risk mitigation strategy contingency plan emergency response protocol disaster recovery business continuity service level agreement contract negotiation tender award project initiation definition scoping charter creation team assembly role assignment responsibility delegation authority granting budget allocation resource mobilization procurement sourcing supply chain logistics inventory control warehouse operation delivery scheduling customer relationship maintenance satisfaction measurement feedback collection analysis continuous improvement loop closed loop open loop feedforward feedback hybrid combination integrated holistic comprehensive systematic structured methodology framework paradigm shift innovation disruption transformation evolution revolution reform restoration renovation reconstruction rebuilding construction erection establishment foundation base support pillar column beam girder truss frame skeleton body mass volume space void gap hole opening aperture window door portal gate entry access route path track trail road highway street avenue boulevard lane alley court yard garden park forest jungle desert ocean sea lake river stream brook creek pond pool fountain spring well shaft pit mine cave den nest burrow lair retreat shelter refuge sanctuary preserve reservation conservancy game farm ranch estate property land ground soil terrain surface level plane platform stage arena theater auditorium concert hall meeting room conference center banquet hall ballroom dance floor skating rink tennis court basketball court football field baseball diamond soccer pitch hockey rink curling sheet bowling green putting green golf course race track speedway motorway freeway parkway expressway interstate highway roadway pathway walkway sidewalk footpath bridleway cycleway bikeway greenway landscaped corridor ecological linkage connectivity corridor migration corridor wildlife corridor habitat corridor conservation corridor biodiversity corridor sustainable development corridor urban growth corridor metropolitan area planning corridor regional economic integration corridor trade route optimization corridor transportation network efficiency improvement corridor communication infrastructure enhancement corridor energy transmission line corridor management corridor water resource allocation decision support system design corridor agricultural land use change impact assessment study area delineation mapping corridor environmental quality monitoring station location suitability evaluation criteria establishment guideline formulation recommendation proposal submission review approval procedure documentation record keeping tracking progress reporting dissemination outreach education training capacity building partnership formation collaboration facilitation coordination leadership governance policy making regulation enforcement compliance auditing verification certification accreditation recognition reward punishment discipline corrective action preventive measure risk mitigation strategy contingency plan emergency response protocol disaster recovery business continuity service level agreement contract negotiation tender award project initiation definition scoping charter creation team assembly role assignment responsibility delegation authority granting budget allocation resource mobilization procurement sourcing supply chain logistics inventory control warehouse operation delivery scheduling customer relationship maintenance satisfaction measurement feedback collection analysis continuous improvement loop closed loop open loop feedforward feedback hybrid combination integrated holistic comprehensive systematic structured methodology framework paradigm shift innovation disruption transformation evolution revolution reform restoration renovation reconstruction rebuilding construction erection establishment foundation base support pillar column beam girder truss frame skeleton body mass volume space void gap hole opening aperture window door portal gate entry access route path track trail road highway street avenue boulevard lane alley court yard garden park forest jungle desert ocean sea lake river stream brook creek pond pool fountain spring well shaft pit mine cave den nest burrow lair retreat shelter refuge sanctuary preserve reservation conservancy game farm ranch estate property land ground soil terrain surface level plane platform stage arena theater auditorium concert hall meeting room conference center banquet hall ballroom dance floor skating rink tennis court basketball court football field baseball diamond soccer pitch hockey rink curling sheet bowling green putting green golf course race track speedway motorway freeway parkway expressway interstate highway roadway pathway walkway sidewalk footpath bridleway cycleway bikeway greenway landscaped corridor ecological linkage connectivity corridor migration corridor wildlife corridor habitat corridor conservation corridor biodiversity corridor sustainable development corridor urban growth corridor metropolitan area planning corridor regional economic integration corridor trade route optimization corridor transportation network efficiency improvement corridor communication infrastructure enhancement corridor energy transmission line corridor management corridor water resource allocation decision support system design corridor agricultural land use change impact assessment study area delineation mapping corridor environmental quality monitoring station location suitability evaluation criteria establishment guideline formulation recommendation proposal submission review approval procedure documentation record keeping tracking progress reporting dissemination outreach education training capacity building partnership formation collaboration facilitation coordination leadership governance policy making regulation enforcement compliance auditing verification certification accreditation recognition reward punishment discipline corrective action preventive measure risk mitigation strategy contingency plan emergency response protocol disaster recovery business continuity service level agreement contract negotiation tender award project initiation definition scoping charter creation team assembly role assignment responsibility delegation authority granting budget allocation resource mobilization procurement sourcing supply chain logistics inventory control warehouse operation delivery scheduling customer relationship maintenance satisfaction measurement feedback collection analysis continuous improvement loop closed loop open loop feed
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值