SAS Module 5 Decision Tree

CART决策树详解

最新推荐文章于 2021-03-29 14:09:27 发布

原创最新推荐文章于 2021-03-29 14:09:27 发布 · 519 阅读

CC 4.0 BY-SA版权

文章标签：

5 篇文章

订阅专栏

本文深入探讨了CART决策树的原理与应用，包括分类树和回归树的构建过程，如何通过最小化RSS和Gini指数进行特征选择，以及如何避免过拟合问题。文章还讨论了决策树的优势和局限性。

SAS

CART: Classification And Regression Tree

For classification trees, final prediction is the mode of the training observations in the region
For regression trees, final prediction is the mean of the training observations in the region

Regression Tree:

We divide the predictor space into J distinct and non-overlapping regions R1,R2,R3…Rj
For each region, we find the mean of the response values for the training observations
Calculate total RSS for each region, and find the smallest one (we can do this on Excel)
This equation is to find good cut-point “s” of predictor “Xj” to split, so it will create {Xj|Xj>=s} and {Xj|Xj<s} with smallest RSS
We take a top-down, greedy approach that is know as binary splitting, so repeat the above process again and again

When to stop? : Prune Tree

If not prune tree, it is likely to overfit the data, leading to poor test set performance
Smaller tree might lead to lower variance and better interpretation
The strategy is to grow a very large tree first, and then prune it to a subtree
Prune the tree to the number of nodes with smallest MSE (mean square error)

Classification Tree:

Similar as regression tree, except that it is used to predict qualitative response
Goal is to find most commonly occurring class (mode)
Still use binary splitting, not depends on RSS, but on Gini Index
m represents different region, and i represents different class(Yes/No,1/0,T/F). Finally, use weight of each region to calculate the final Gini index

Advantage and of Trees:

Easy to interpret
More closely mirror human decision-making process
Can be displayed graphically
Trees are good for both qualitative and quantitive response, not need to create extra dummy variables for qualitative response

Disadvantage of Trees:

Do not have same level of predictive accuracy as other approaches (but can use Bagging, Forest and Boosting to improve)
High variance: different partitions of the same data set may product quite different trees
Instability: Very minor changes can result in significantly different trees