随机森林

最新推荐文章于 2025-11-26 15:16:48 发布

最新推荐文章于 2025-11-26 15:16:48 发布 · 238 阅读

文章标签：

#数据结构与算法

机器学习专栏收录该内容

12 篇文章

订阅专栏

随机森林的基本过程是：（m*n，m为样本数，n为特征维）1，训练：随机选择若干特征r<<n（似乎一般去sqrt(n)）,构造决策树；2，预测：通过所有决策树分类，然后以投票方式，得票数最多的分类即为分类值。

决策树构造过程如下，其中最大化Information gain来获得最有效的特征：

How to grow a Decision Tree

source : [3]

LearnUnprunedTree(X,Y)

Input: X a matrix of R rows and M columns where X_ij = the value of the j'th attribute in the i'th input datapoint. Each column consists of either all real values or all categorical values.
Input: Y a vector of R elements, where Y_i = the output class of the i'th datapoint. TheY_i values are categorical.
Output: An Unpruned decision tree

If all records in X have identical values in all their attributes (this includes the case where R<2), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes
If all values in Y are the same, return a Leaf Node predicting this value as the output
Else
    select m variables at random out of the M variables
    For j = 1 .. m
        If j'th attribute is categorical
          IG_j = IG(Y|X_j) (see Information Gain)
        Else (j'th attribute is real-valued)
            IG_j = IG*(Y|X_j) (see Information Gain)
    Let j* = argmax_j IG_j (this is the splitting attribute we'll use)
    If j* is categorical then
        For each value v of the j'th attribute
            Let X^v = subset of rows of X in which X_ij = v. Let Y^v = corresponding subset of Y
Let Child^v = LearnUnprunedTree(X^v,Y^v)
        Return a decision tree node, splitting on j'th attribute. The number of children equals the number of values of the j'th attribute, and the v'th child is Child^v
    Else j* is real-valued and let t be the best split threshold
        Let X^LO = subset of rows of X in which X_ij <= t. Let Y^LO = corresponding subset of Y
Let Child^LO = LearnUnprunedTree(X^LO,Y^LO)
Let X^HI = subset of rows of X in which X_ij > t. Let Y^HI = corresponding subset of Y
        Let Child^HI = LearnUnprunedTree(X^HI,Y^HI)
Return a decision tree node, splitting on j'th attribute. It has two children corresponding to whether the j'th attribute is above or below the given threshold.

Note: There are alternatives to Information Gain for splitting nodes

注意，分类和实值求最大information gain是不同的，这里只说明实值的情形，IG*(Y|X_j)=max_t IG(Y|X_j:t),这样同时确定了best split的值t。

entropy=-sum p_ilog(p_i),（entropy即为H函数） high entropy意味着变量为boring distribution,即变量取各个值的概率差距不大； low entropy意味着变量为varied(peaks and valley)distribution,即变量取某一个或两个值的概率特别高。我们的目的就是要找到Low entropy的特征，因为 information gain= H(Y)-H(Y|X)，在H(Y)固定时，找到的H(Y|X)越低，则该特征去某一个值或两个值的概率越高，能够分类清楚的样本数越多，这样就越该被先选中作为分支节点。

Information gain

source : [3]

nominal attributes

suppose X can have one of m values V₁,V₂,...,V_m
P(X=V₁)=p₁, P(X=V₂)=p₂,...,P(X=V_m)=p_m

H(X)= -sum_j=1^m p_j log₂ p_j (The entropy of X)
H(Y|X=v) = the entropy of Y among only those records in which X has value v
H(Y|X) = sum_j p_j H(Y|X=v_j)
IG(Y|X) = H(Y) - H(Y|X)

real-valued attributes

How to grow a Random Forest

source : [1]

Each tree is grown as follows:

if the number of cases in the training set is N, sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
if there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
each tree is grown to its large extent possible. There is no pruning.

Random Forest parameters

source : [2]
Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m) to be randomly selected from the available set of variables.
Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.