error=bias+variance
有监督分类器的error是由bias和variance两个部分构成,前者高意味着underfitting,模型对于训练数据的描述能力不够,后者高意味着over-fitting,纵然model对于training data能够准确描述,但是换到新的数据上就跪了。
Ensemble是现在十分popular的方法,在我印象中一直觉得ensemble性能十分优秀的原因是其理论上通过ensemble weak classifier来不断的降低bias,使得error趋近于一个上限(variance)。Ensemble算法中,Random Forest,Adaboost和Gradient Boost Tree都是典型代表,也是各类竞赛的主力模型。我一直以为他们都具有上述美好的理论基础,今天认真一查发现错了。
原文来自这里:
- Boosting is based on weak learners (high bias, low variance). In terms of decision trees, weak learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves). Boosting reduces error by reducing bias (and also to some extent variance, by aggregating the output from many models).
- On the other hand, Random Forest uses as you said fully grown decision trees (low bias, but high variance), because it tackles the error reduction task in the opposite way: that is, by reducing variance. It makes the trees uncorrelated to maximize the decrease in variance, but it can't reduce bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for the bias to be initially as low as possible, and to have large, unprunned trees.
Please note that unlike Boosting (which is sequential), RF grows trees in parallel. The term iterative
that you used in thus inappropriate.
RF是bagging(Bootstrap aggregating)的,每次从总的数据集上进行有放回抽取数据,每次构造一棵决策树,纵然这样的树本身属于strong classifier容易过拟合(high variance),但是通过大量的树ensemble一起降低了variance,减少了over-fitting的可能。在最后的通票结果中,每棵树的话语权都是相同的。
Adaboost和Gradient Boost Tree都是boosting系列的,先构造一个weak classifier(高bias低variance),然后根据它的不足,在构建下一个树的时候弥补它的不足。只不过Adaboost是根据公式调整系数,而GB是通过梯度寻找最佳的优化调整方向。即不同weak classifier所占的权重,让聪明的classifier预测时投票多,傻的就少投票,从而降低了bias.
总之,我之前认识中的那个理论假设只是对于boosting算法的。ensemble还有许多其他方法,如Stacking、Bayes optimal classifier等等(但明显不如boosting和bagging常见),不同的ensemble也有不同的背景和优势。