回归算法:线性回归、决策树回归与回归森林的实现与应用
1. 线性回归估计
1.1 线性回归原理
线性回归试图用二维空间中的一条直线、三维空间中的一个平面等去拟合尽可能多的数据点,它探索观测值和目标值之间的线性关系,这种关系用线性方程或加权和函数表示。给定一个具有 $n$ 个特征的数据样本 $x=(x_1,x_2,\ldots,x_n)$,以及线性回归模型的权重向量 $w=(w_1,w_2,\ldots,w_n)$,目标值 $y$ 可表示为:$y = \sum_{i = 1}^{n}w_ix_i$。有时,线性回归模型还会有一个截距 $w_0$,此时线性关系变为:$y = w_0+\sum_{i = 1}^{n}w_ix_i$。
线性回归模型的权重向量 $w$ 是从训练数据中学习得到的,目标是最小化估计误差,这里用均方误差(MSE)来衡量,即真实值和预测值之间差值平方的平均值。对于 $m$ 个训练样本 $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\ldots,(x^{(m)},y^{(m)})$,关于待优化权重的代价函数 $J(w)$ 可表示为:
[J(w)=\frac{1}{2m}\sum_{i = 1}^{m}(h_w(x^{(i)}) - y^{(i)})^2]
其中,$h_w(x^{(i)})$ 是预测值。
我们可以使用梯度下降法来获得使 $J(w)$ 最小化的最优 $w$。梯度 $\Delta w$ 的一阶导数推导如下,结合梯度和学习率 $\eta$,权重向量 $w$ 在每一步的更新公式为:
[w := w-\eta\frac{\partial J(w)}{\partial w}]
经过大量迭代后,学习到的 $w$ 可用于预测新样本 $x’$:
[\hat{y}=w_0+\sum_{i = 1}^{n}w_ix_i’]
1.2 线性回归的实现
以下是实现线性回归的具体代码:
import numpy as np
def compute_prediction(X, weights):
""" Compute the prediction y_hat based on current weights
Args:
X (numpy.ndarray)
weights (numpy.ndarray)
Returns:
numpy.ndarray, y_hat of X under weights
"""
predictions = np.dot(X, weights)
return predictions
def update_weights_gd(X_train, y_train, weights, learning_rate):
""" Update weights by one step
Args:
X_train, y_train (numpy.ndarray, training data set)
weights (numpy.ndarray)
learning_rate (float)
Returns:
numpy.ndarray, updated weights
"""
predictions = compute_prediction(X_train, weights)
weights_delta = np.dot(X_train.T, y_train - predictions)
m = y_train.shape[0]
weights += learning_rate / float(m) * weights_delta
return weights
def compute_cost(X, y, weights):
""" Compute the cost J(w)
Args:
X, y (numpy.ndarray, data set)
weights (numpy.ndarray)
Returns:
float
"""
predictions = compute_prediction(X, weights)
cost = np.mean((predictions - y) ** 2 / 2.0)
return cost
def train_linear_regression(X_train, y_train, max_iter,
learning_rate, fit_intercept=False):
""" Train a linear regression model with gradient descent
Args:
X_train, y_train (numpy.ndarray, training data set)
max_iter (int, number of iterations)
learning_rate (float)
fit_intercept (bool, with an intercept w0 or not)
Returns:
numpy.ndarray, learned weights
"""
if fit_intercept:
intercept = np.ones((X_train.shape[0], 1))
X_train = np.hstack((intercept, X_train))
weights = np.zeros(X_train.shape[1])
for iteration in range(max_iter):
weights = update_weights_gd(
X_train, y_train, weights, learning_rate)
# Check the cost for every 100 (for example) iterations
if iteration % 100 == 0:
print(compute_cost(X_train, y_train, weights))
return weights
def predict(X, weights):
if X.shape[1] == weights.shape[0] - 1:
intercept = np.ones((X.shape[0], 1))
X = np.hstack((intercept, X))
return compute_prediction(X, weights)
1.3 线性回归示例
下面通过一个小例子来测试线性回归模型:
X_train = np.array([[6], [2], [3], [4], [1],
[5], [2], [6], [4], [7]])
y_train = np.array([5.5, 1.6, 2.2, 3.7, 0.8,
5.2, 1.5, 5.3, 4.4, 6.8])
weights = train_linear_regression(X_train, y_train,
max_iter=100, learning_rate=0.01, fit_intercept=True)
X_test = np.array([[1.3], [3.5], [5.2], [2.8]])
predictions = predict(X_test, weights)
import matplotlib.pyplot as plt
plt.scatter(X_train[:, 0], y_train, marker='o', c='b')
plt.scatter(X_test[:, 0], predictions, marker='*', c='k')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
我们还可以使用 scikit-learn 中的糖尿病数据集进行测试:
from sklearn import datasets
diabetes = datasets.load_diabetes()
print(diabetes.data.shape) # (442, 10)
num_test = 30
X_train = diabetes.data[:-num_test, :]
y_train = diabetes.target[:-num_test]
weights = train_linear_regression(X_train, y_train,
max_iter=5000, learning_rate=1, fit_intercept=True)
X_test = diabetes.data[-num_test:, :]
y_test = diabetes.target[-num_test:]
predictions = predict(X_test, weights)
print(predictions)
print(y_test)
1.4 随机梯度下降(SGD)
除了梯度下降法,线性回归也可以使用随机梯度下降(SGD)。我们可以直接使用 scikit-learn 中的
SGDRegressor
:
from sklearn.linear_model import SGDRegressor
regressor = SGDRegressor(loss='squared_loss', penalty='l2',
alpha=0.0001, learning_rate='constant', eta0=0.01, n_iter=1000)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)
1.5 TensorFlow 实现
以下是使用 TensorFlow 实现线性回归的代码:
import tensorflow as tf
n_features = int(X_train.shape[1])
learning_rate = 0.5
n_iter = 1000
x = tf.placeholder(tf.float32, shape=[None, n_features])
y = tf.placeholder(tf.float32, shape=[None])
W = tf.Variable(tf.ones([n_features, 1]))
b = tf.Variable(tf.zeros([1]))
pred = tf.add(tf.matmul(x, W), b)[:, 0]
cost = tf.losses.mean_squared_error(labels=y, predictions=pred)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init_vars = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init_vars)
for i in range(1, n_iter + 1):
_, c = sess.run([optimizer, cost],
feed_dict={x: X_train, y: y_train})
if i % 100 == 0:
print('Iteration %i, training loss: %f' % (i, c))
predictions = sess.run(pred, feed_dict={x: X_test})
print(predictions)
2. 决策树回归估计
2.1 从分类树到回归树
在分类问题中,决策树通过递归的二元分裂构建,每个节点分裂为左右子节点。在每个划分中,它贪婪地搜索特征及其值的最显著组合作为最优分裂点,分裂质量通过两个子节点标签的加权纯度来衡量,具体使用基尼不纯度或信息增益等指标。而在回归问题中,树的构建过程与分类树几乎相同,但由于目标值变为连续值,有两个不同点:
- 分裂点的质量现在通过两个子节点的加权均方误差(MSE)来衡量,子节点的 MSE 相当于所有目标值的方差,加权 MSE 越小,分裂越好。
- 终端节点中目标值的平均值成为叶节点的值,而不是分类树中的多数标签。
2.2 决策树回归示例
下面通过一个房价估计的小例子来理解回归树:
import numpy as np
def mse(targets):
# When the set is empty
if targets.size == 0:
return 0
return np.var(targets)
def weighted_mse(groups):
""" Calculate weighted MSE of children after a split
Args:
groups (list of children, and a child consists a list
of targets)
Returns:
float, weighted impurity
"""
total = sum(len(group) for group in groups)
weighted_sum = 0.0
for group in groups:
weighted_sum += len(group) / float(total) * mse(group)
return weighted_sum
print('{0:.4f}'.format(mse(np.array([1, 2, 3]))))
print('{0:.4f}'.format(weighted_mse([np.array([1, 2, 3]),
np.array([1, 2])])))
假设我们有以下房屋数据:
| 房屋类型 | 卧室数量 | 房价 |
| ---- | ---- | ---- |
| semi | 3 | 600 |
| detached | 2 | 700 |
| detached | 3 | 800 |
| semi | 2 | 400 |
| semi | 4 | 700 |
我们需要计算所有可能的特征和值对的 MSE:
# 省略具体计算过程,结果如下
# MSE(type, semi) = 10333
# MSE(bedroom, 2) = 13000
# MSE(bedroom, 3) = 16000
# MSE(bedroom, 4) = 17500
最低的 MSE 由
type, semi
对实现,因此根节点由该分裂点形成。如果我们对一层深的回归树满意,可以将两个分支都设为叶节点,其值为包含样本目标值的平均值。也可以继续从右分支构建第二层:
# 省略第二层分裂的 MSE 计算过程
# MSE(bedroom, 2) = 15556
# MSE(bedroom, 3) = 1667
# MSE(bedroom, 4) = 6667
第二层分裂点由
bedroom, 3
对指定,其 MSE 最小。
2.3 决策树回归实现
以下是决策树回归的具体实现代码:
def split_node(X, y, index, value):
""" Split data set X, y based on a feature and a value
Args:
X, y (numpy.ndarray, data set)
index (int, index of the feature used for splitting)
value (value of the feature used for splitting)
Returns:
list, list: left and right child, a child is in the
format of [X, y]
"""
x_index = X[:, index]
# if this feature is numerical
if type(X[0, index]) in [int, float]:
mask = x_index >= value
# if this feature is categorical
else:
mask = x_index == value
# split into left and right child
left = [X[~mask, :], y[~mask]]
right = [X[mask, :], y[mask]]
return left, right
def get_best_split(X, y):
""" Obtain the best splitting point and resulting children
for the data set X, y
Args:
X, y (numpy.ndarray, data set)
criterion (gini or entropy)
Returns:
dict {index: index of the feature, value: feature
value, children: left and right children}
"""
best_index, best_value, best_score, children =
None, None, 1e10, None
for index in range(len(X[0])):
for value in np.sort(np.unique(X[:, index])):
groups = split_node(X, y, index, value)
impurity = weighted_mse([groups[0][1],
groups[1][1]])
if impurity < best_score:
best_index, best_value, best_score, children =
index, value, impurity, groups
return {'index': best_index, 'value': best_value,
'children': children}
def get_leaf(targets):
# Obtain the leaf as the mean of the targets
return np.mean(targets)
def split(node, max_depth, min_size, depth):
""" Split children of a node to construct new nodes or
assign them terminals
Args:
node (dict, with children info)
max_depth (int, maximal depth of the tree)
min_size (int, minimal samples required to further
split a child)
depth (int, current depth of the node)
"""
left, right = node['children']
del (node['children'])
if left[1].size == 0:
node['right'] = get_leaf(right[1])
return
if right[1].size == 0:
node['left'] = get_leaf(left[1])
return
# Check if the current depth exceeds the maximal depth
if depth >= max_depth:
node['left'], node['right'] =
get_leaf(left[1]), get_leaf(right[1])
return
# Check if the left child has enough samples
if left[1].size <= min_size:
node['left'] = get_leaf(left[1])
else:
# It has enough samples, we further split it
result = get_best_split(left[0], left[1])
result_left, result_right = result['children']
if result_left[1].size == 0:
node['left'] = get_leaf(result_right[1])
elif result_right[1].size == 0:
node['left'] = get_leaf(result_left[1])
else:
node['left'] = result
split(node['left'], max_depth, min_size,
depth + 1)
# Check if the right child has enough samples
if right[1].size <= min_size:
node['right'] = get_leaf(right[1])
else:
# It has enough samples, we further split it
result = get_best_split(right[0], right[1])
result_left, result_right = result['children']
if result_left[1].size == 0:
node['right'] = get_leaf(result_right[1])
elif result_right[1].size == 0:
node['right'] = get_leaf(result_left[1])
else:
node['right'] = result
split(node['right'], max_depth, min_size,
depth + 1)
def train_tree(X_train, y_train, max_depth, min_size):
""" Construction of a tree starts here
Args:
X_train, y_train (list, list, training data)
max_depth (int, maximal depth of the tree)
min_size (int, minimal samples required to further
split a child)
"""
root = get_best_split(X_train, y_train)
split(root, max_depth, min_size, 1)
return root
2.4 决策树回归测试
使用前面的房屋数据进行测试:
X_train = np.array([['semi', 3],
['detached', 2],
['detached', 3],
['semi', 2],
['semi', 4]], dtype=object)
y_train = np.array([600, 700, 800, 400, 700])
tree = train_tree(X_train, y_train, 2, 2)
CONDITION = {'numerical': {'yes': '>=', 'no': '<'},
'categorical': {'yes': 'is', 'no': 'is not'}}
def visualize_tree(node, depth=0):
if isinstance(node, dict):
if type(node['value']) in [int, float]:
condition = CONDITION['numerical']
else:
condition = CONDITION['categorical']
print('{}|- X{} {} {}'.format(depth * ' ',
node['index'] + 1, condition['no'], node['value']))
if 'left' in node:
visualize_tree(node['left'], depth + 1)
print('{}|- X{} {} {}'.format(depth * ' ',
node['index'] + 1, condition['yes'], node['value']))
if 'right' in node:
visualize_tree(node['right'], depth + 1)
else:
print('{}[{}]'.format(depth * ' ', node))
visualize_tree(tree)
2.5 scikit-learn 实现
使用 scikit-learn 中的
DecisionTreeRegressor
对波士顿房价进行预测:
from sklearn import datasets
boston = datasets.load_boston()
num_test = 10 # the last 10 samples as testing set
X_train = boston.data[:-num_test, :]
y_train = boston.target[:-num_test]
X_test = boston.data[-num_test:, :]
y_test = boston.target[-num_test:]
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth=10,
min_samples_split=3)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)
print(y_test)
3. 回归森林实现
3.1 回归森林原理
在第 6 章中,我们介绍了随机森林作为一种集成学习方法,它通过组合多个单独训练的决策树,并在树的每个节点随机子采样训练特征。在分类中,随机森林通过所有树决策的多数投票做出最终决策。在回归中,随机森林回归模型(也称为回归森林)将所有决策树的回归结果的平均值作为最终决策。
3.2 scikit-learn 实现
使用 scikit-learn 中的
RandomForestRegressor
对波士顿房价进行预测:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100,
max_depth=10, min_samples_split=3)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)
3.3 TensorFlow 实现
以下是使用 TensorFlow 实现回归森林的代码:
import tensorflow as tf
from tensorflow.contrib.tensor_forest.python import tensor_forest
from tensorflow.python.ops import resources
n_iter = 20
n_features = int(X_train.shape[1])
n_trees = 10
max_nodes = 30000
x = tf.placeholder(tf.float32, shape=[None, n_features])
y = tf.placeholder(tf.float32, shape=[None])
hparams = tensor_forest.ForestHParams(num_classes=1,
regression=True, num_features=n_features,
num_trees=n_trees, max_nodes=max_nodes,
split_after_samples=30).fill()
forest_graph = tensor_forest.RandomForestGraphs(hparams)
train_op = forest_graph.training_graph(x, y)
loss_op = forest_graph.training_loss(x, y)
infer_op, _, _ = forest_graph.inference_graph(x)
cost = tf.losses.mean_squared_error(labels=y, predictions=infer_op[:, 0])
init_vars = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer(),
resources.initialize_resources(resources.shared_resources()))
sess = tf.Session()
sess.run(init_vars)
for i in range(1, n_iter + 1):
_, c = sess.run([train_op, cost], feed_dict={x: X_train, y: y_train})
print('Iteration %i, training loss: %f' % (i, c))
pred = sess.run(infer_op, feed_dict={x: X_test})[:, 0]
print(pred)
综上所述,我们介绍了线性回归、决策树回归和回归森林三种回归算法的原理、实现和应用。线性回归通过梯度下降或随机梯度下降来学习权重;决策树回归通过递归二元分裂构建树,并使用加权 MSE 来选择最优分裂点;回归森林则是集成多个决策树的结果。这些算法在不同的场景下都有各自的优势,可以根据具体问题选择合适的算法。
4. 算法对比与选择
4.1 性能对比
为了更直观地对比线性回归、决策树回归和回归森林的性能,我们可以从以下几个方面进行分析:
| 算法 | 优点 | 缺点 | 适用场景 |
| ---- | ---- | ---- | ---- |
| 线性回归 | 原理简单,计算效率高,可解释性强 | 对非线性关系拟合能力差 | 数据呈现线性关系,需要快速得到结果且对模型可解释性要求较高的场景 |
| 决策树回归 | 能处理非线性关系,不需要对数据进行预处理,可解释性较好 | 容易过拟合,不稳定 | 数据存在复杂的非线性关系,对模型可解释性有一定要求的场景 |
| 回归森林 | 能有效处理非线性关系,减少过拟合,稳定性好 | 计算复杂度较高,可解释性相对较差 | 数据存在复杂的非线性关系,对模型准确性和稳定性要求较高,对可解释性要求相对较低的场景 |
4.2 选择建议
在实际应用中,我们可以根据以下步骤选择合适的回归算法:
1.
数据特征分析
:观察数据的分布和特征,判断数据是否呈现线性关系。如果数据大致呈线性分布,线性回归可能是一个不错的选择;如果数据存在复杂的非线性关系,则可以考虑决策树回归或回归森林。
2.
模型复杂度和可解释性需求
:如果对模型的可解释性要求较高,希望能够清晰地了解每个特征对结果的影响,线性回归或决策树回归可能更合适;如果更注重模型的准确性和稳定性,而对可解释性要求相对较低,回归森林是一个更好的选择。
3.
计算资源和时间限制
:线性回归的计算复杂度较低,训练速度快,适合在计算资源有限或时间紧迫的情况下使用;决策树回归的计算复杂度适中,训练速度相对较快;回归森林的计算复杂度较高,训练时间较长,需要更多的计算资源。
5. 总结与展望
5.1 总结
本文详细介绍了线性回归、决策树回归和回归森林三种回归算法的原理、实现和应用。线性回归通过梯度下降或随机梯度下降来学习权重,适用于线性关系的数据;决策树回归通过递归二元分裂构建树,并使用加权 MSE 来选择最优分裂点,能处理非线性关系;回归森林则是集成多个决策树的结果,能有效减少过拟合,提高模型的准确性和稳定性。
5.2 展望
随着机器学习和人工智能的不断发展,回归算法也在不断创新和改进。未来,我们可以期待以下几个方面的发展:
-
更高效的算法
:研究人员将不断探索更高效的回归算法,以提高模型的训练速度和预测准确性。
-
集成学习的进一步发展
:集成学习已经在回归问题中取得了很好的效果,未来可能会有更多的集成方法被提出,进一步提高模型的性能。
-
与深度学习的结合
:深度学习在处理复杂数据方面具有强大的能力,将回归算法与深度学习相结合,可能会产生更强大的模型。
-
应用领域的拓展
:回归算法将在更多的领域得到应用,如医疗、金融、交通等,为解决实际问题提供更有效的方法。
总之,回归算法在机器学习中具有重要的地位,不断学习和掌握这些算法,将有助于我们更好地解决实际问题。希望本文能够对读者有所帮助,激发大家对回归算法的兴趣和研究热情。
6. 流程图
6.1 线性回归训练流程
graph TD;
A[初始化权重] --> B[计算预测值];
B --> C[计算代价函数];
C --> D[计算梯度];
D --> E[更新权重];
E --> F{是否达到最大迭代次数};
F -- 否 --> B;
F -- 是 --> G[输出最终权重];
6.2 决策树回归构建流程
graph TD;
A[开始] --> B[获取最佳分裂点];
B --> C[分裂节点];
C --> D{是否满足停止条件};
D -- 是 --> E[设置为叶节点];
D -- 否 --> F[递归分裂左右子节点];
F --> B;
E --> G[结束];
6.3 回归森林训练流程
graph TD;
A[开始] --> B[初始化参数];
B --> C[构建多个决策树];
C --> D[集成决策树结果];
D --> E[训练模型];
E --> F{是否达到最大迭代次数};
F -- 否 --> C;
F -- 是 --> G[输出最终模型];
通过以上流程图,我们可以更清晰地了解三种回归算法的训练和构建过程,有助于我们更好地理解和应用这些算法。
超级会员免费看
1177

被折叠的 条评论
为什么被折叠?



