随机森林的基本过程是:(m*n,m为样本数,n为特征维)1,训练:随机选择若干特征r<<n(似乎一般去sqrt(n)),构造决策树;2,预测:通过所有决策树分类,然后以投票方式,得票数最多的分类即为分类值。
决策树构造过程如下,其中最大化Information gain来获得最有效的特征:
How to grow a Decision Tree
source : [3]
LearnUnprunedTree(X,Y)
Input: X a matrix of R rows and M columns where Xij = the value of the j'th attribute in the i'th input datapoint. Each column consists of either all real values or all categorical values.
Input: Y a vector of R elements, where Yi = the output class of the i'th datapoint. TheYi values are categorical.
Output: An Unpruned decision tree
If all records in X have identical values in all their attributes (this includes the case where R<2), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes
If all values in Y are the same, return a Leaf Node predicting this value as the output
Else
select m variables at random out of the M variables
For j = 1 .. m
If j'th attribute is categorical
IGj = IG(Y|Xj) (see Information Gain)
Else (j'th attribute is real-valued)
IGj = IG*(Y|Xj) (see Information Gain)
Let j* = argmaxj IGj (this is the splitting attribute we'll use)
If j* is categorical then
For each value v of the j'th attribute
Let Xv = subset of rows of X in which Xij = v. Let Yv = corresponding subset of Y
Let Childv = LearnUnprunedTree(Xv,Yv)
Return a decision tree node, splitting on j'th attribute. The number of children equals the number of values of the j'th attribute, and the v'th child is Childv
Else j* is real-valued and let t be the best split threshold
Let XLO = subset of rows of X in which Xij <= t. Let YLO = corresponding subset of Y
Let ChildLO = LearnUnprunedTree(XLO,YLO)
Let XHI = subset of rows of X in which Xij > t. Let YHI = corresponding subset of Y
Let ChildHI = LearnUnprunedTree(XHI,YHI)
Return a decision tree node, splitting on j'th attribute. It has two children corresponding to whether the j'th attribute is above or below the given threshold.
Note: There are alternatives to Information Gain for splitting nodes
注意,分类和实值求最大information gain是不同的,这里只说明实值的情形,IG*(Y|Xj)=max_t IG(Y|X_j:t),这样同时确定了best split的值t。
entropy=-sum p_ilog(p_i),(entropy即为H函数) high entropy意味着变量为boring distribution,即变量取各个值的概率差距不大; low entropy意味着变量为varied(peaks and valley)distribution,即变量取某一个或两个值的概率特别高。我们的目的就是要找到Low entropy的特征,因为 information gain= H(Y)-H(Y|X),在H(Y)固定时,找到的H(Y|X)越低,则该特征去某一个值或两个值的概率越高,能够分类清楚的样本数越多,这样就越该被先选中作为分支节点。
Information gain
source : [3]
suppose X can have one of m values V1,V2,...,Vm
P(X=V1)=p1, P(X=V2)=p2,...,P(X=Vm)=pm
H(X)= -sumj=1m pj log2 pj (The entropy of X)
H(Y|X=v) = the entropy of Y among only those records in which X has value v
H(Y|X) = sumj pj H(Y|X=vj)
IG(Y|X) = H(Y) - H(Y|X)
suppose X is real valued
define IG(Y|X:t) as H(Y) - H(Y|X:t)
define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t)
define IG*(Y|X) = maxt IG(Y|X:t)
How to grow a Random Forest
source : [1]
Each tree is grown as follows:
- if the number of cases in the training set is N, sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
- if there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
- each tree is grown to its large extent possible. There is no pruning.
Random Forest parameters
source : [2]
Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m) to be randomly selected from the available set of variables.
Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.