20、回归算法：线性回归、决策树回归与回归森林的实现与应用

Wind6

于 2025-09-30 11:35:16 发布

阅读量64

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习实战精讲文章标签：线性回归决策树回归回归森林

本文链接：https://blog.youkuaiyun.com/wind6/article/details/152425039

机器学习实战精讲专栏收录该内容

23 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

回归算法：线性回归、决策树回归与回归森林的实现与应用

1. 线性回归估计

1.1 线性回归原理

线性回归试图用二维空间中的一条直线、三维空间中的一个平面等去拟合尽可能多的数据点，它探索观测值和目标值之间的线性关系，这种关系用线性方程或加权和函数表示。给定一个具有 $n$ 个特征的数据样本 $x=(x_1,x_2,\ldots,x_n)$，以及线性回归模型的权重向量 $w=(w_1,w_2,\ldots,w_n)$，目标值 $y$ 可表示为：$y = \sum_{i = 1}^{n}w_ix_i$。有时，线性回归模型还会有一个截距 $w_0$，此时线性关系变为：$y = w_0+\sum_{i = 1}^{n}w_ix_i$。

线性回归模型的权重向量 $w$ 是从训练数据中学习得到的，目标是最小化估计误差，这里用均方误差（MSE）来衡量，即真实值和预测值之间差值平方的平均值。对于 $m$ 个训练样本 $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\ldots,(x^{(m)},y^{(m)})$，关于待优化权重的代价函数 $J(w)$ 可表示为：
[J(w)=\frac{1}{2m}\sum_{i = 1}^{m}(h_w(x^{(i)}) - y^{(i)})^2]
其中，$h_w(x^{(i)})$ 是预测值。

我们可以使用梯度下降法来获得使 $J(w)$ 最小化的最优 $w$。梯度 $\Delta w$ 的一阶导数推导如下，结合梯度和学习率 $\eta$，权重向量 $w$ 在每一步的更新公式为：
[w := w-\eta\frac{\partial J(w)}{\partial w}]
经过大量迭代后，学习到的 $w$ 可用于预测新样本 $x’$：
[\hat{y}=w_0+\sum_{i = 1}^{n}w_ix_i’]

1.2 线性回归的实现

以下是实现线性回归的具体代码：

import numpy as np

def compute_prediction(X, weights):
    """ Compute the prediction y_hat based on current weights
    Args:
        X (numpy.ndarray)
        weights (numpy.ndarray)
    Returns:
        numpy.ndarray, y_hat of X under weights
    """
    predictions = np.dot(X, weights)
    return predictions

def update_weights_gd(X_train, y_train, weights, learning_rate):
    """ Update weights by one step
    Args:
        X_train, y_train (numpy.ndarray, training data set)
        weights (numpy.ndarray)
        learning_rate (float)
    Returns:
        numpy.ndarray, updated weights
    """
    predictions = compute_prediction(X_train, weights)
    weights_delta = np.dot(X_train.T, y_train - predictions)
    m = y_train.shape[0]
    weights += learning_rate / float(m) * weights_delta
    return weights

def compute_cost(X, y, weights):
    """ Compute the cost J(w)
    Args:
        X, y (numpy.ndarray, data set)
        weights (numpy.ndarray)
    Returns:
        float
    """
    predictions = compute_prediction(X, weights)
    cost = np.mean((predictions - y) ** 2 / 2.0)
    return cost

def train_linear_regression(X_train, y_train, max_iter, 
                            learning_rate, fit_intercept=False):
    """ Train a linear regression model with gradient descent
    Args:
        X_train, y_train (numpy.ndarray, training data set)
        max_iter (int, number of iterations)
        learning_rate (float)
        fit_intercept (bool, with an intercept w0 or not)
    Returns:
        numpy.ndarray, learned weights
    """
    if fit_intercept:
        intercept = np.ones((X_train.shape[0], 1))
        X_train = np.hstack((intercept, X_train))
    weights = np.zeros(X_train.shape[1])
    for iteration in range(max_iter):
        weights = update_weights_gd(
            X_train, y_train, weights, learning_rate)
        # Check the cost for every 100 (for example) iterations
        if iteration % 100 == 0:
            print(compute_cost(X_train, y_train, weights))
    return weights

def predict(X, weights):
    if X.shape[1] == weights.shape[0] - 1:
        intercept = np.ones((X.shape[0], 1))
        X = np.hstack((intercept, X))
    return compute_prediction(X, weights)

1.3 线性回归示例

下面通过一个小例子来测试线性回归模型：

X_train = np.array([[6], [2], [3], [4], [1], 
                    [5], [2], [6], [4], [7]])
y_train = np.array([5.5, 1.6, 2.2, 3.7, 0.8, 
                    5.2, 1.5, 5.3, 4.4, 6.8])

weights = train_linear_regression(X_train, y_train,
                                  max_iter=100, learning_rate=0.01, fit_intercept=True)

X_test = np.array([[1.3], [3.5], [5.2], [2.8]])
predictions = predict(X_test, weights)

import matplotlib.pyplot as plt
plt.scatter(X_train[:, 0], y_train, marker='o', c='b')
plt.scatter(X_test[:, 0], predictions, marker='*', c='k')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

我们还可以使用 scikit-learn 中的糖尿病数据集进行测试：

from sklearn import datasets
diabetes = datasets.load_diabetes()
print(diabetes.data.shape)  # (442, 10)
num_test = 30 
X_train = diabetes.data[:-num_test, :]
y_train = diabetes.target[:-num_test]

weights = train_linear_regression(X_train, y_train,
                                  max_iter=5000, learning_rate=1, fit_intercept=True)

X_test = diabetes.data[-num_test:, :]
y_test = diabetes.target[-num_test:]
predictions = predict(X_test, weights)
print(predictions)
print(y_test)

1.4 随机梯度下降（SGD）

除了梯度下降法，线性回归也可以使用随机梯度下降（SGD）。我们可以直接使用 scikit-learn 中的 SGDRegressor ：

from sklearn.linear_model import SGDRegressor

regressor = SGDRegressor(loss='squared_loss', penalty='l2',
                         alpha=0.0001, learning_rate='constant', eta0=0.01, n_iter=1000)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)

1.5 TensorFlow 实现

以下是使用 TensorFlow 实现线性回归的代码：

import tensorflow as tf

n_features = int(X_train.shape[1])
learning_rate = 0.5
n_iter = 1000

x = tf.placeholder(tf.float32, shape=[None, n_features])
y = tf.placeholder(tf.float32, shape=[None])
W = tf.Variable(tf.ones([n_features, 1]))
b = tf.Variable(tf.zeros([1]))

pred = tf.add(tf.matmul(x, W), b)[:, 0]

cost = tf.losses.mean_squared_error(labels=y, predictions=pred)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

init_vars = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_vars)

for i in range(1, n_iter + 1):
    _, c = sess.run([optimizer, cost], 
                    feed_dict={x: X_train, y: y_train})
    if i % 100 == 0:
        print('Iteration %i, training loss: %f' % (i, c))

predictions = sess.run(pred, feed_dict={x: X_test})
print(predictions)

2. 决策树回归估计

2.1 从分类树到回归树

在分类问题中，决策树通过递归的二元分裂构建，每个节点分裂为左右子节点。在每个划分中，它贪婪地搜索特征及其值的最显著组合作为最优分裂点，分裂质量通过两个子节点标签的加权纯度来衡量，具体使用基尼不纯度或信息增益等指标。而在回归问题中，树的构建过程与分类树几乎相同，但由于目标值变为连续值，有两个不同点：
- 分裂点的质量现在通过两个子节点的加权均方误差（MSE）来衡量，子节点的 MSE 相当于所有目标值的方差，加权 MSE 越小，分裂越好。
- 终端节点中目标值的平均值成为叶节点的值，而不是分类树中的多数标签。

2.2 决策树回归示例

下面通过一个房价估计的小例子来理解回归树：

import numpy as np

def mse(targets):
    # When the set is empty
    if targets.size == 0:
        return 0
    return np.var(targets)

def weighted_mse(groups):
    """ Calculate weighted MSE of children after a split
    Args:
        groups (list of children, and a child consists a list 
                of targets)
    Returns:
        float, weighted impurity
    """
    total = sum(len(group) for group in groups)
    weighted_sum = 0.0
    for group in groups:
        weighted_sum += len(group) / float(total) * mse(group)
    return weighted_sum

print('{0:.4f}'.format(mse(np.array([1, 2, 3]))))
print('{0:.4f}'.format(weighted_mse([np.array([1, 2, 3]), 
                                     np.array([1, 2])])))

假设我们有以下房屋数据：
| 房屋类型 | 卧室数量 | 房价 |
| ---- | ---- | ---- |
| semi | 3 | 600 |
| detached | 2 | 700 |
| detached | 3 | 800 |
| semi | 2 | 400 |
| semi | 4 | 700 |

我们需要计算所有可能的特征和值对的 MSE：

# 省略具体计算过程，结果如下
# MSE(type, semi) = 10333
# MSE(bedroom, 2) = 13000
# MSE(bedroom, 3) = 16000
# MSE(bedroom, 4) = 17500

最低的 MSE 由 type, semi 对实现，因此根节点由该分裂点形成。如果我们对一层深的回归树满意，可以将两个分支都设为叶节点，其值为包含样本目标值的平均值。也可以继续从右分支构建第二层：

# 省略第二层分裂的 MSE 计算过程
# MSE(bedroom, 2) = 15556
# MSE(bedroom, 3) = 1667
# MSE(bedroom, 4) = 6667

第二层分裂点由 bedroom, 3 对指定，其 MSE 最小。

2.3 决策树回归实现

以下是决策树回归的具体实现代码：

def split_node(X, y, index, value):
    """ Split data set X, y based on a feature and a value
    Args:
        X, y (numpy.ndarray, data set)
        index (int, index of the feature used for splitting)
        value (value of the feature used for splitting)
    Returns:
        list, list: left and right child, a child is in the 
                    format of [X, y]
    """
    x_index = X[:, index]
    # if this feature is numerical
    if type(X[0, index]) in [int, float]:
        mask = x_index >= value
    # if this feature is categorical
    else:
        mask = x_index == value
    # split into left and right child
    left = [X[~mask, :], y[~mask]]
    right = [X[mask, :], y[mask]]
    return left, right

def get_best_split(X, y):
    """ Obtain the best splitting point and resulting children 
        for the data set X, y
    Args:
        X, y (numpy.ndarray, data set)
        criterion (gini or entropy)
    Returns:
        dict {index: index of the feature, value: feature 
                  value, children: left and right children}
    """
    best_index, best_value, best_score, children = 
        None, None, 1e10, None
    for index in range(len(X[0])):
        for value in np.sort(np.unique(X[:, index])):
            groups = split_node(X, y, index, value)
            impurity = weighted_mse([groups[0][1], 
                                     groups[1][1]])
            if impurity < best_score:
                best_index, best_value, best_score, children = 
                    index, value, impurity, groups
    return {'index': best_index, 'value': best_value, 
            'children': children}

def get_leaf(targets):
    # Obtain the leaf as the mean of the targets
    return np.mean(targets)

def split(node, max_depth, min_size, depth):
    """ Split children of a node to construct new nodes or 
        assign them terminals
    Args:
        node (dict, with children info)
        max_depth (int, maximal depth of the tree)
        min_size (int, minimal samples required to further 
                      split a child)
        depth (int, current depth of the node)
    """
    left, right = node['children']
    del (node['children'])
    if left[1].size == 0:
        node['right'] = get_leaf(right[1])
        return
    if right[1].size == 0:
        node['left'] = get_leaf(left[1])
        return
    # Check if the current depth exceeds the maximal depth
    if depth >= max_depth:
        node['left'], node['right'] = 
            get_leaf(left[1]), get_leaf(right[1])
        return
    # Check if the left child has enough samples
    if left[1].size <= min_size:
        node['left'] = get_leaf(left[1])
    else:
        # It has enough samples, we further split it
        result = get_best_split(left[0], left[1])
        result_left, result_right = result['children']
        if result_left[1].size == 0:
            node['left'] = get_leaf(result_right[1])
        elif result_right[1].size == 0:
            node['left'] = get_leaf(result_left[1])
        else:
            node['left'] = result
            split(node['left'], max_depth, min_size, 
                  depth + 1)
    # Check if the right child has enough samples
    if right[1].size <= min_size:
        node['right'] = get_leaf(right[1])
    else:
        # It has enough samples, we further split it
        result = get_best_split(right[0], right[1])
        result_left, result_right = result['children']
        if result_left[1].size == 0:
            node['right'] = get_leaf(result_right[1])
        elif result_right[1].size == 0:
            node['right'] = get_leaf(result_left[1])
        else:
            node['right'] = result
            split(node['right'], max_depth, min_size, 
                  depth + 1)

def train_tree(X_train, y_train, max_depth, min_size):
    """ Construction of a tree starts here
    Args:
        X_train, y_train (list, list, training data)
        max_depth (int, maximal depth of the tree)
        min_size (int, minimal samples required to further 
                      split a child)
    """
    root = get_best_split(X_train, y_train)
    split(root, max_depth, min_size, 1)
    return root

2.4 决策树回归测试

使用前面的房屋数据进行测试：

X_train = np.array([['semi', 3],
                    ['detached', 2],
                    ['detached', 3],
                    ['semi', 2],
                    ['semi', 4]], dtype=object)
y_train = np.array([600, 700, 800, 400, 700])
tree = train_tree(X_train, y_train, 2, 2)

CONDITION = {'numerical': {'yes': '>=', 'no': '<'},
             'categorical': {'yes': 'is', 'no': 'is not'}}

def visualize_tree(node, depth=0):
    if isinstance(node, dict):
        if type(node['value']) in [int, float]:
            condition = CONDITION['numerical']
        else:
            condition = CONDITION['categorical']
        print('{}|- X{} {} {}'.format(depth * ' ', 
              node['index'] + 1, condition['no'], node['value']))
        if 'left' in node:
            visualize_tree(node['left'], depth + 1)
        print('{}|- X{} {} {}'.format(depth * ' ', 
              node['index'] + 1, condition['yes'], node['value']))
        if 'right' in node:
            visualize_tree(node['right'], depth + 1)
    else:
        print('{}[{}]'.format(depth * ' ', node))

visualize_tree(tree)

2.5 scikit-learn 实现

使用 scikit-learn 中的 DecisionTreeRegressor 对波士顿房价进行预测：

from sklearn import datasets
boston = datasets.load_boston()
num_test = 10  # the last 10 samples as testing set
X_train = boston.data[:-num_test, :]
y_train = boston.target[:-num_test]
X_test = boston.data[-num_test:, :]
y_test = boston.target[-num_test:]

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth=10, 
                                  min_samples_split=3)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)
print(y_test)

3. 回归森林实现

3.1 回归森林原理

在第 6 章中，我们介绍了随机森林作为一种集成学习方法，它通过组合多个单独训练的决策树，并在树的每个节点随机子采样训练特征。在分类中，随机森林通过所有树决策的多数投票做出最终决策。在回归中，随机森林回归模型（也称为回归森林）将所有决策树的回归结果的平均值作为最终决策。

3.2 scikit-learn 实现

使用 scikit-learn 中的 RandomForestRegressor 对波士顿房价进行预测：

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, 
                                  max_depth=10, min_samples_split=3)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)

3.3 TensorFlow 实现

以下是使用 TensorFlow 实现回归森林的代码：

import tensorflow as tf
from tensorflow.contrib.tensor_forest.python import tensor_forest
from tensorflow.python.ops import resources

n_iter = 20
n_features = int(X_train.shape[1])
n_trees = 10
max_nodes = 30000

x = tf.placeholder(tf.float32, shape=[None, n_features])
y = tf.placeholder(tf.float32, shape=[None])
hparams = tensor_forest.ForestHParams(num_classes=1,
                                      regression=True, num_features=n_features, 
                                      num_trees=n_trees, max_nodes=max_nodes, 
                                      split_after_samples=30).fill()
forest_graph = tensor_forest.RandomForestGraphs(hparams)

train_op = forest_graph.training_graph(x, y)
loss_op = forest_graph.training_loss(x, y)
infer_op, _, _ = forest_graph.inference_graph(x)
cost = tf.losses.mean_squared_error(labels=y, predictions=infer_op[:, 0])

init_vars = tf.group(tf.global_variables_initializer(), 
                     tf.local_variables_initializer(), 
                     resources.initialize_resources(resources.shared_resources())) 
sess = tf.Session()
sess.run(init_vars)

for i in range(1, n_iter + 1):
    _, c = sess.run([train_op, cost], feed_dict={x: X_train, y: y_train})
    print('Iteration %i, training loss: %f' % (i, c))

pred = sess.run(infer_op, feed_dict={x: X_test})[:, 0]
print(pred)

综上所述，我们介绍了线性回归、决策树回归和回归森林三种回归算法的原理、实现和应用。线性回归通过梯度下降或随机梯度下降来学习权重；决策树回归通过递归二元分裂构建树，并使用加权 MSE 来选择最优分裂点；回归森林则是集成多个决策树的结果。这些算法在不同的场景下都有各自的优势，可以根据具体问题选择合适的算法。

4. 算法对比与选择

4.1 性能对比

为了更直观地对比线性回归、决策树回归和回归森林的性能，我们可以从以下几个方面进行分析：
| 算法 | 优点 | 缺点 | 适用场景 |
| ---- | ---- | ---- | ---- |
| 线性回归 | 原理简单，计算效率高，可解释性强 | 对非线性关系拟合能力差 | 数据呈现线性关系，需要快速得到结果且对模型可解释性要求较高的场景 |
| 决策树回归 | 能处理非线性关系，不需要对数据进行预处理，可解释性较好 | 容易过拟合，不稳定 | 数据存在复杂的非线性关系，对模型可解释性有一定要求的场景 |
| 回归森林 | 能有效处理非线性关系，减少过拟合，稳定性好 | 计算复杂度较高，可解释性相对较差 | 数据存在复杂的非线性关系，对模型准确性和稳定性要求较高，对可解释性要求相对较低的场景 |

4.2 选择建议

在实际应用中，我们可以根据以下步骤选择合适的回归算法：
1. 数据特征分析 ：观察数据的分布和特征，判断数据是否呈现线性关系。如果数据大致呈线性分布，线性回归可能是一个不错的选择；如果数据存在复杂的非线性关系，则可以考虑决策树回归或回归森林。
2. 模型复杂度和可解释性需求 ：如果对模型的可解释性要求较高，希望能够清晰地了解每个特征对结果的影响，线性回归或决策树回归可能更合适；如果更注重模型的准确性和稳定性，而对可解释性要求相对较低，回归森林是一个更好的选择。
3. 计算资源和时间限制 ：线性回归的计算复杂度较低，训练速度快，适合在计算资源有限或时间紧迫的情况下使用；决策树回归的计算复杂度适中，训练速度相对较快；回归森林的计算复杂度较高，训练时间较长，需要更多的计算资源。

5. 总结与展望

5.1 总结

本文详细介绍了线性回归、决策树回归和回归森林三种回归算法的原理、实现和应用。线性回归通过梯度下降或随机梯度下降来学习权重，适用于线性关系的数据；决策树回归通过递归二元分裂构建树，并使用加权 MSE 来选择最优分裂点，能处理非线性关系；回归森林则是集成多个决策树的结果，能有效减少过拟合，提高模型的准确性和稳定性。

5.2 展望

随着机器学习和人工智能的不断发展，回归算法也在不断创新和改进。未来，我们可以期待以下几个方面的发展：
- 更高效的算法 ：研究人员将不断探索更高效的回归算法，以提高模型的训练速度和预测准确性。
- 集成学习的进一步发展 ：集成学习已经在回归问题中取得了很好的效果，未来可能会有更多的集成方法被提出，进一步提高模型的性能。
- 与深度学习的结合 ：深度学习在处理复杂数据方面具有强大的能力，将回归算法与深度学习相结合，可能会产生更强大的模型。
- 应用领域的拓展 ：回归算法将在更多的领域得到应用，如医疗、金融、交通等，为解决实际问题提供更有效的方法。

总之，回归算法在机器学习中具有重要的地位，不断学习和掌握这些算法，将有助于我们更好地解决实际问题。希望本文能够对读者有所帮助，激发大家对回归算法的兴趣和研究热情。

6. 流程图

6.1 线性回归训练流程

graph TD;
    A[初始化权重] --> B[计算预测值];
    B --> C[计算代价函数];
    C --> D[计算梯度];
    D --> E[更新权重];
    E --> F{是否达到最大迭代次数};
    F -- 否 --> B;
    F -- 是 --> G[输出最终权重];

6.2 决策树回归构建流程

graph TD;
    A[开始] --> B[获取最佳分裂点];
    B --> C[分裂节点];
    C --> D{是否满足停止条件};
    D -- 是 --> E[设置为叶节点];
    D -- 否 --> F[递归分裂左右子节点];
    F --> B;
    E --> G[结束];

6.3 回归森林训练流程

graph TD;
    A[开始] --> B[初始化参数];
    B --> C[构建多个决策树];
    C --> D[集成决策树结果];
    D --> E[训练模型];
    E --> F{是否达到最大迭代次数};
    F -- 否 --> C;
    F -- 是 --> G[输出最终模型];

通过以上流程图，我们可以更清晰地了解三种回归算法的训练和构建过程，有助于我们更好地理解和应用这些算法。