15、回归算法在股票价格预测中的应用-优快云博客

本文链接：https://blog.youkuaiyun.com/cicd6pipeline/article/details/154217234

回归算法在股票价格预测中的应用

1. 线性回归

1.1 线性回归原理

线性回归用于探索观测值与目标之间的线性关系，通过线性方程或加权和函数来表示。给定具有 n 个特征的数据样本 $x = (x_1, x_2, …, x_n)$ 以及线性回归模型的权重向量 $w = (w_1, w_2, …, w_n)$，目标 $y$ 可表示为：

$y = w_1x_1 + w_2x_2 + … + w_nx_n$

有时，线性回归模型会带有截距 $w_0$，上述线性关系变为：

$y = w_0 + w_1x_1 + w_2x_2 + … + w_nx_n$

线性回归模型的权重向量 $w$ 是从训练数据中学习得到的，目标是最小化均方误差（MSE）定义的估计误差。给定 $m$ 个训练样本 $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), …, (x^{(m)}, y^{(m)})$，关于待优化权重的成本函数 $J(w)$ 可表示为：

$J(w) = \frac{1}{2m} \sum_{i=1}^{m} (h_w(x^{(i)}) - y^{(i)})^2$

其中 $h_w(x) = w_0 + w_1x_1 + w_2x_2 + … + w_nx_n$。

我们可以通过梯度下降法获得使 $J(w)$ 最小化的最优 $w$。一阶导数（梯度）$\nabla J(w)$ 可推导如下：

$\nabla J(w) = \frac{1}{m} \sum_{i=1}^{m} (h_w(x^{(i)}) - y^{(i)})x^{(i)}$

结合梯度和学习率 $\alpha$，权重向量 $w$ 可以在每一步更新如下：

$w := w - \alpha \nabla J(w)$

1.2 线性回归的实现

以下是实现线性回归的代码：

import numpy as np

def compute_prediction(X, weights):
    """ Compute the prediction y_hat based on current weights
    Args:
        X (numpy.ndarray)
        weights (numpy.ndarray)
    Returns:
        numpy.ndarray, y_hat of X under weights
    """
    predictions = np.dot(X, weights)
    return predictions

def update_weights_gd(X_train, y_train, weights, learning_rate):
    """ Update weights by one step
    Args:
        X_train, y_train (numpy.ndarray, training data set)
        weights (numpy.ndarray)
        learning_rate (float)
    Returns:
        numpy.ndarray, updated weights
    """
    predictions = compute_prediction(X_train, weights)
    weights_delta = np.dot(X_train.T, y_train - predictions)
    m = y_train.shape[0]
    weights += learning_rate / float(m) * weights_delta
    return weights

def compute_cost(X, y, weights):
    """ Compute the cost J(w)
    Args:
        X, y (numpy.ndarray, data set)
        weights (numpy.ndarray)
    Returns:
        float
    """
    predictions = compute_prediction(X, weights)
    cost = np.mean((predictions - y) ** 2 / 2.0)
    return cost

def train_linear_regression(X_train, y_train, max_iter, learning_rate, fit_intercept=False):
    """ Train a linear regression model with gradient descent
    Args:
        X_train, y_train (numpy.ndarray, training data set)
        max_iter (int, number of iterations)
        learning_rate (float)
        fit_intercept (bool, with an intercept w0 or not)
    Returns:
        numpy.ndarray, learned weights
    """
    if fit_intercept:
        intercept = np.ones((X_train.shape[0], 1))
        X_train = np.hstack((intercept, X_train))
    weights = np.zeros(X_train.shape[1])
    for iteration in range(max_iter):
        weights = update_weights_gd(X_train, y_train, weights, learning_rate)
        # Check the cost for every 100 (for example) iterations
        if iteration % 100 == 0:
            print(compute_cost(X_train, y_train, weights))
    return weights

def predict(X, weights):
    if X.shape[1] == weights.shape[0] - 1:
        intercept = np.ones((X.shape[0], 1))
        X = np.hstack((intercept, X))
    return compute_prediction(X, weights)

1.3 线性回归示例

以下是一个小示例，展示如何使用上述代码进行线性回归：

X_train = np.array([[6], [2], [3], [4], [1], [5], [2], [6], [4], [7]])
y_train = np.array([5.5, 1.6, 2.2, 3.7, 0.8, 5.2, 1.5, 5.3, 4.4, 6.8])

weights = train_linear_regression(X_train, y_train, max_iter=100, learning_rate=0.01, fit_intercept=True)

X_test = np.array([[1.3], [3.5], [5.2], [2.8]])
predictions = predict(X_test, weights)

import matplotlib.pyplot as plt
plt.scatter(X_train[:, 0], y_train, marker='o', c='b')
plt.scatter(X_test[:, 0], predictions, marker='*', c='k')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

1.4 糖尿病数据集示例

我们还可以使用 scikit-learn 中的糖尿病数据集进行线性回归：

from sklearn import datasets

diabetes = datasets.load_diabetes()
print(diabetes.data.shape)  # (442, 10)
num_test = 30  # the last 30 samples as testing set
X_train = diabetes.data[:-num_test, :]
y_train = diabetes.target[:-num_test]

weights = train_linear_regression(X_train, y_train, max_iter=5000, learning_rate=1, fit_intercept=True)

X_test = diabetes.data[-num_test:, :]
y_test = diabetes.target[-num_test:]
predictions = predict(X_test, weights)
print(predictions)

1.5 随机梯度下降

除了梯度下降，线性回归也可以使用随机梯度下降。我们可以直接使用 scikit-learn 中的 SGDRegressor：

from sklearn.linear_model import SGDRegressor

regressor = SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.0001, learning_rate='constant', eta0=0.01, n_iter=1000)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)

2. 决策树回归

2.1 决策树回归原理

决策树回归（也称为回归树）的树构建过程与分类树类似，但由于目标变量是连续的，有两个主要区别：
- 分割点的质量现在由两个子节点的加权均方误差（MSE）来衡量，子节点的 MSE 相当于所有目标值的方差，加权 MSE 越小，分割越好。
- 终端节点的目标值的平均值成为叶节点的值，而不是分类树中的多数标签。

2.2 决策树回归的实现

以下是实现决策树回归的代码：

import numpy as np

def mse(targets):
    # When the set is empty
    if targets.size == 0:
        return 0
    return np.var(targets)

def weighted_mse(groups):
    """ Calculate weighted MSE of children after a split
    Args:
        groups (list of children, and a child consists a list of targets)
    Returns:
        float, weighted impurity
    """
    total = sum(len(group) for group in groups)
    weighted_sum = 0.0
    for group in groups:
        weighted_sum += len(group) / float(total) * mse(group)
    return weighted_sum

def split_node(X, y, index, value):
    """ Split data set X, y based on a feature and a value
    Args:
        X, y (numpy.ndarray, data set)
        index (int, index of the feature used for splitting)
        value (value of the feature used for splitting)
    Returns:
        list, list: left and right child, a child is in the format of [X, y]
    """
    x_index = X[:, index]
    # if this feature is numerical
    if type(X[0, index]) in [int, float]:
        mask = x_index >= value
    # if this feature is categorical
    else:
        mask = x_index == value
    # split into left and right child
    left = [X[~mask, :], y[~mask]]
    right = [X[mask, :], y[mask]]
    return left, right

def get_best_split(X, y):
    """ Obtain the best splitting point and resulting children for the data set X, y
    Args:
        X, y (numpy.ndarray, data set)
        criterion (gini or entropy)
    Returns:
        dict {index: index of the feature, value: feature value, children: left and right children}
    """
    best_index, best_value, best_score, children = None, None, 1e10, None
    for index in range(len(X[0])):
        for value in np.sort(np.unique(X[:, index])):
            groups = split_node(X, y, index, value)
            impurity = weighted_mse([groups[0][1], groups[1][1]])
            if impurity < best_score:
                best_index, best_value, best_score, children = index, value, impurity, groups
    return {'index': best_index, 'value': best_value, 'children': children}

def get_leaf(targets):
    # Obtain the leaf as the mean of the targets
    return np.mean(targets)

def split(node, max_depth, min_size, depth):
    """ Split children of a node to construct new nodes or assign them terminals
    Args:
        node (dict, with children info)
        max_depth (int, maximal depth of the tree)
        min_size (int, minimal samples required to further split a child)
        depth (int, current depth of the node)
    """
    left, right = node['children']
    del (node['children'])
    if left[1].size == 0:
        node['right'] = get_leaf(right[1])
        return
    if right[1].size == 0:
        node['left'] = get_leaf(left[1])
        return
    # Check if the current depth exceeds the maximal depth
    if depth >= max_depth:
        node['left'], node['right'] = get_leaf(left[1]), get_leaf(right[1])
        return
    # Check if the left child has enough samples
    if left[1].size <= min_size:
        node['left'] = get_leaf(left[1])
    else:
        # It has enough samples, we further split it
        result = get_best_split(left[0], left[1])
        result_left, result_right = result['children']
        if result_left[1].size == 0:
            node['left'] = get_leaf(result_right[1])
        elif result_right[1].size == 0:
            node['left'] = get_leaf(result_left[1])
        else:
            node['left'] = result
            split(node['left'], max_depth, min_size, depth + 1)
    # Check if the right child has enough samples
    if right[1].size <= min_size:
        node['right'] = get_leaf(right[1])
    else:
        # It has enough samples, we further split it
        result = get_best_split(right[0], right[1])
        result_left, result_right = result['children']
        if result_left[1].size == 0:
            node['right'] = get_leaf(result_right[1])
        elif result_right[1].size == 0:
            node['right'] = get_leaf(result_left[1])
        else:
            node['right'] = result
            split(node['right'], max_depth, min_size, depth + 1)

def train_tree(X_train, y_train, max_depth, min_size):
    """ Construction of a tree starts here
    Args:
        X_train,  y_train (list, list, training data)
        max_depth (int, maximal depth of the tree)
        min_size (int, minimal samples required to further split a child)
    """
    root = get_best_split(X_train, y_train)
    split(root, max_depth, min_size, 1)
    return root

CONDITION = {'numerical': {'yes': '>=', 'no': '<'},
             'categorical': {'yes': 'is', 'no': 'is not'}}

def visualize_tree(node, depth=0):
    if isinstance(node, dict):
        if type(node['value']) in [int, float]:
            condition = CONDITION['numerical']
        else:
            condition = CONDITION['categorical']
        print('{}|- X{} {} {}'.format(depth * '  ', node['index'] + 1, condition['no'], node['value']))
        if 'left' in node:
            visualize_tree(node['left'], depth + 1)
        print('{}|- X{} {} {}'.format(depth * '  ', node['index'] + 1, condition['yes'], node['value']))
        if 'right' in node:
            visualize_tree(node['right'], depth + 1)
    else:
        print('{}[{}]'.format(depth * '  ', node))

2.3 决策树回归示例

以下是一个房屋价格估计的小示例：

X_train = np.array([['semi', 3],
                    ['detached', 2],
                    ['detached', 3],
                    ['semi', 2],
                    ['semi', 4]], dtype=object)
y_train = np.array([600, 700, 800, 400, 700])

tree = train_tree(X_train, y_train, 2, 2)
visualize_tree(tree)

2.4 使用 scikit-learn 进行决策树回归

我们可以直接使用 scikit-learn 中的 DecisionTreeRegressor 进行波士顿房屋价格预测：

from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets

boston = datasets.load_boston()
num_test = 10  # the last 10 samples as testing set
X_train = boston.data[:-num_test, :]
y_train = boston.target[:-num_test]
X_test = boston.data[-num_test:, :]
y_test = boston.target[-num_test:]

regressor = DecisionTreeRegressor(max_depth=10, min_samples_split=3)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)
print(y_test)

2.5 随机森林回归

随机森林回归是一种集成学习方法，它将多个决策树的回归结果平均作为最终决策。我们可以使用 scikit-learn 中的 RandomForestRegressor：

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=3)
regressor.fit(X_train, y_train)
predictions = regressor.predict(X_test)
print(predictions)

线性回归与决策树回归流程对比

算法	步骤
线性回归	1. 定义预测函数 2. 定义权重更新函数 3. 定义成本函数 4. 训练模型 5. 进行预测
决策树回归	1. 定义 MSE 和加权 MSE 计算函数 2. 定义节点分割函数 3. 定义最佳分割点搜索函数 4. 定义叶节点获取函数 5. 定义递归分割函数 6. 训练模型 7. 可视化树

线性回归与决策树回归流程图

graph LR
    classDef startend fill:#F5EBFF,stroke:#BE8FED,stroke-width:2px;
    classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;
    classDef decision fill:#FFF6CC,stroke:#FFBC52,stroke-width:2px;

    A([开始]):::startend --> B{选择算法}:::decision
    B -->|线性回归| C(定义预测函数):::process
    C --> D(定义权重更新函数):::process
    D --> E(定义成本函数):::process
    E --> F(训练模型):::process
    F --> G(进行预测):::process
    B -->|决策树回归| H(定义 MSE 和加权 MSE 计算函数):::process
    H --> I(定义节点分割函数):::process
    I --> J(定义最佳分割点搜索函数):::process
    J --> K(定义叶节点获取函数):::process
    K --> L(定义递归分割函数):::process
    L --> M(训练模型):::process
    M --> N(可视化树):::process
    G --> O([结束]):::startend
    N --> O

3. 支持向量回归

3.1 支持向量回归原理

支持向量回归（SVR）属于支持向量家族，是支持向量分类（SVC）的“兄弟”算法。回顾一下，SVC 旨在寻找一个最优超平面，以最好地分隔不同类别的观测值。假设一个超平面由斜率向量 $w$ 和截距 $b$ 确定，最优超平面的选择使得分隔空间中最近点到超平面本身的距离最大化。

而在 SVR 中，我们的目标是找到一个超平面（由斜率向量 $w$ 和截距 $b$ 定义），使得距离该超平面 $\epsilon$ 距离的两个超平面 $y = w^T x + b + \epsilon$ 和 $y = w^T x + b - \epsilon$ 能够覆盖大部分训练数据。也就是说，大部分数据点都被限制在最优超平面的 $\epsilon$ 带内。同时，最优超平面要尽可能平坦，即 $|w|^2$ 要尽可能小。

这转化为通过解决以下优化问题来推导最优的 $w$ 和 $b$：
- 最小化 $\frac{1}{2} |w|^2 + C \sum_{i=1}^{m} (\xi_i + \xi_i^ ) $
- 约束条件为 $y_i - w^T x_i - b \leq \epsilon + \xi_i$，$w^T x_i + b - y_i \leq \epsilon + \xi_i^ $，$\xi_i \geq 0$，$\xi_i^ \geq 0$，其中 $C$ 是惩罚参数，$\xi_i$ 和 $\xi_i^ $ 是松弛变量。

3.2 支持向量回归的实现

由于解决上述优化问题需要二次规划技术，超出了我们的学习范围，因此我们将使用 scikit-learn 中的 SVR 包来实现该回归算法。

以下是使用 SVR 解决之前房屋价格预测问题的代码：

from sklearn.svm import SVR
from sklearn import datasets

# 加载波士顿房价数据集
boston = datasets.load_boston()
num_test = 10  # 最后 10 个样本作为测试集
X_train = boston.data[:-num_test, :]
y_train = boston.target[:-num_test]
X_test = boston.data[-num_test:, :]
y_test = boston.target[-num_test:]

# 创建 SVR 回归器
regressor = SVR(C=0.1, epsilon=0.02, kernel='linear')
# 训练模型
regressor.fit(X_train, y_train)
# 进行预测
predictions = regressor.predict(X_test)
print(predictions)

4. 回归性能评估

4.1 评估指标

为了更深入地评估回归模型的性能，我们可以使用以下指标：
- 均方误差（MSE） ：衡量与期望值对应的平方损失。有时会对 MSE 取平方根，以将值转换回被估计目标变量的原始尺度，这就得到了均方根误差（RMSE）。
- 平均绝对误差（MAE） ：衡量绝对损失，使用与目标变量相同的尺度，能让我们了解预测值与实际值的接近程度。
- $R^2$ 得分 ：表示回归模型的拟合优度，范围从 0 到 1，意味着从无拟合到完美预测。

对于 MSE 和 MAE，值越小，回归模型越好。

4.2 糖尿病数据集示例评估

以下是使用 scikit-learn 中的相应函数计算这些指标的示例，我们将重新使用糖尿病数据集，并通过网格搜索技术调整线性回归模型的参数：

from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDRegressor

# 加载糖尿病数据集
diabetes = datasets.load_diabetes()
num_test = 30  # 最后 30 个样本作为测试集
X_train = diabetes.data[:-num_test, :]
y_train = diabetes.target[:-num_test]
X_test = diabetes.data[-num_test:, :]
y_test = diabetes.target[-num_test:]

# 定义参数网格
param_grid = {
    "alpha": [1e-07, 1e-06, 1e-05],
    "penalty": [None, "l2"],
    "eta0": [0.001, 0.005, 0.01],
    "n_iter": [300, 1000, 3000]
}

# 创建 SGDRegressor 回归器
regressor = SGDRegressor(loss='squared_loss', learning_rate='constant')
# 创建网格搜索对象
grid_search = GridSearchCV(regressor, param_grid, cv=3)

# 进行网格搜索
grid_search.fit(X_train, y_train)
# 打印最佳参数
print(grid_search.best_params_)

# 获取最佳估计器
regressor_best = grid_search.best_estimator_
# 进行预测
predictions = regressor_best.predict(X_test)

4.3 不同回归算法性能评估对比

算法	优点	缺点	适用场景
线性回归	简单易懂，计算效率高	对非线性关系处理能力弱	数据呈现线性关系的场景
决策树回归	能处理非线性关系，可解释性强	容易过拟合	数据存在复杂非线性关系且需要可解释模型的场景
支持向量回归	对高维数据和非线性关系处理能力强	计算复杂度高，参数调整困难	数据维度高且存在非线性关系的场景

回归算法选择流程图

graph LR
    classDef startend fill:#F5EBFF,stroke:#BE8FED,stroke-width:2px;
    classDef process fill:#E5F6FF,stroke:#73A6FF,stroke-width:2px;
    classDef decision fill:#FFF6CC,stroke:#FFBC52,stroke-width:2px;

    A([开始]):::startend --> B{数据关系是否线性?}:::decision
    B -->|是| C(线性回归):::process
    B -->|否| D{数据维度是否高?}:::decision
    D -->|是| E(支持向量回归):::process
    D -->|否| F{是否需要可解释模型?}:::decision
    F -->|是| G(决策树回归):::process
    F -->|否| H(其他非线性回归算法):::process
    C --> I([结束]):::startend
    E --> I
    G --> I
    H --> I

综上所述，不同的回归算法有各自的优缺点和适用场景。在实际应用中，我们需要根据数据的特点和问题的需求选择合适的回归算法，并通过性能评估指标来评估模型的优劣，以达到最佳的预测效果。