7天微课程day5——用于时间序列的自相关模型AM

最新推荐文章于 2025-04-17 10:54:37 发布

hustqb

最新推荐文章于 2025-04-17 10:54:37 发布

阅读量3.4k

点赞数 2

分类专栏： 7天微课堂——时间序列

7天微课堂——时间序列专栏收录该内容

8 篇文章

订阅专栏

本文介绍了时间序列预测中的自相关性检验方法，并通过实例演示了如何使用自回归(AM)模型进行预测。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

声明：

本文是系列课程的第5课
本文是对机器学习网站课程的翻译
尊重原作者，尊重知识分享

用于时间序列的自相关模型AM

AM全称Autoregression Model，它以 $t$ 时刻之间的值为输入，通过一个回归方程计算 $t$ 时刻的预测值。AM简单高效，是时间序列预测的常用方法。

在本文中，你将学到：

如何研究时间序列的自相关性
如何定义、训练、使用一个自相关模型
如何进行rolling prediction

Autoregression

线性回归模型 $\hat y = b0 + b1 \cdot X1 + b2 \cdot X2$ 。该模型假设输入与输出之间存在线性关系。已知系数 $b0, b1, b2$ 后，可用该回归模型进行预测。

如果我们用 $t-1$ 时刻的值代替 $X1$ ， $t-2$ 时刻的值代替 $X2$ ，则形成线性回归模型 $X(t) = b0 + b1 \cdot X(t-1) + b2 \cdot X(t-2)$ 。

用于时间序列预测的回归模型，其输入和输出都来自于同一个时间序列，故称为自相关模型(Autoregression)。

Autocorrelation

Autoregression假设 $t$ 时刻的预测与 $t$ 时刻之前的观测值有关。这种变量与变量之间的关系被称为相关性(correlation)，在同一时间序列中的相关性被称为自相关性(Autocorrelation)。

先前的值与当前预测的相关性越强，则公式 $X(t) = b0 + b1 \cdot X(t-1) + b2 \cdot X(t-2)$ 中的相应的系数越大。例如，若 $X(t-2)$ 与 $X(t)$ 有很强的相关性，则 $b2$ 的值较大。

也就是说，通过自相关性大小，我们可以知道那个滞后值对预测当前值的影响较大。如果所有滞后值对当前值的自相关性都很小，说明该序列是噪声序列，不可预测。所以，在预测时间序列之初，计算自相关性非常有用。

下面我们将用Python实践AM。

Minimum Daily Temperatures Dataset

该数据集是澳大利亚城市墨尔本与1981-1990年10年共3650天最低气温，详见Learn more about the dataset here.
Note: 数据文件中可能有一些乱码字符，记得去掉。

from pandas import Series
from matplotlib import pyplot
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
print(series.head())
series.plot()
pyplot.show()
'''output
Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
Name: Temp, dtype: float64
'''

显然，这些温度数据存在一定的周期性，那么到底 $t$ 时刻的值会不会与它之前的时刻的值有关呢？下面通过几个方法快速检验(quick check)自相关性。

quick check自相关性——散点图

下图所示的是 $t$ 时刻与 $t+1$ 时刻的散点图，即对于每一个 $t$ 时刻的值 $y(t)$ ，与 $t+1$ 时刻的值 $y(t+1)$ 形成坐标对 ${y(t), y(t+1)}$ ，并可视化为散点图。

Python中Pandas提供了内建函数lag_plot()实现这个功能。

from pandas import Series
from matplotlib import pyplot
from pandas.tools.plotting import lag_plot
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
lag_plot(series, lag=1)
pyplot.show()

可以看到相邻连个观测点之间，温度具有很强的相关性。

你也可以调整lag_plot()的参数lag来观察不同时刻的相关关系。另一种quick check时间序列中自相关性的方法是直接计算相关值——计算皮尔逊相关系数。皮尔逊相关系数的值域为[-1, 1]，若相关系数大于0.5或小于-0.5说明相关性较强。

from pandas import Series
from pandas import DataFrame
from pandas import concat
from matplotlib import pyplot
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
values = DataFrame(series.values)
dataframe = concat([values.shift(1), values], axis=1)
dataframe.columns = ['t-1', 't']
result = dataframe.corr()  # 计算一个相关系数矩阵
print(result)
'''output
         t-1      t
t-1  1.00000  0.77487
t    0.77487  1.00000
'''

可以看到，自己和自己的相关系数显然为1，相邻两个时刻的相关系数为0.77>0.5，说明有较强的相关性。

quick check自相关性——自相关图

还记得我们在day3中介绍的这个图吗？

更直观一点的，用线状图来表示相关系数：

from pandas import Series
from matplotlib import pyplot
from statsmodels.graphics.tsaplots import plot_acf
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
plot_acf(series, lags=31)
pyplot.show()

现在，我们对时间序列的自相关性有了一定的了解，在AM之前，我们先用Persistence Model计算一个baseline。

Persistence Model

详见day4

from pandas import Series
from pandas import DataFrame
from pandas import concat
from matplotlib import pyplot
from sklearn.metrics import mean_squared_error

series = Series.from_csv('daily-minimum-temperatures.csv', header=0)

# create lagged dataset
values = DataFrame(series.values)
dataframe = concat([values.shift(1), values], axis=1)
dataframe.columns = ['t-1', 't']

# split into train and test sets
X = dataframe.values
train, test = X[1:len(X)-7], X[len(X)-7:]
train_X, train_y = train[:,0], train[:,1]
test_X, test_y = test[:,0], test[:,1]

# persistence model
def model_persistence(x):
    return x

# walk-forward validation
predictions = list()
for x in test_X:
    yhat = model_persistence(x)
    predictions.append(yhat)
test_score = mean_squared_error(test_y, predictions)
print('Test MSE: %.3f' % test_score)

# plot predictions vs expected
pyplot.plot(test_y)
pyplot.plot(predictions, color='red')
pyplot.show()
'''output
Test MSE: 3.423
'''

红色是预测值

Autoregression Model

一种实现思路是借助sklearn中的线性模型，另一种实现思路是利用statsmodels中的AR类。本文采用第二种。

首先用AR()创建一个model，然后调用fit()训练模型，返回一个训练好的模型——ARResult对象。最后我们用predict()函数进行预测。

from pandas import Series
from matplotlib import pyplot
from statsmodels.tsa.ar_model import AR
from sklearn.metrics import mean_squared_error
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)

# split dataset
X = series.values
train, test = X[1:len(X)-7], X[len(X)-7:]

# train autoregression
model = AR(train)
model_fit = model.fit()
print('Lag: %s' % model_fit.k_ar)
print('Coefficients: %s' % model_fit.params)

# make predictions
# 注意这里一次预测了整个test
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)
for i in range(len(predictions)):
    print('predicted=%f, expected=%f' % (predictions[i], test[i]))
error = mean_squared_error(test, predictions)
print('Test MSE: %.3f' % error)

# plot results
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()
'''output
Lag: 29
Coefficients: [  5.57543506e-01   5.88595221e-01  -9.08257090e-02   4.82615092e-02
   4.00650265e-02   3.93020055e-02   2.59463738e-02   4.46675960e-02
   1.27681498e-02   3.74362239e-02  -8.11700276e-04   4.79081949e-03
   1.84731397e-02   2.68908418e-02   5.75906178e-04   2.48096415e-02
   7.40316579e-03   9.91622149e-03   3.41599123e-02  -9.11961877e-03
   2.42127561e-02   1.87870751e-02   1.21841870e-02  -1.85534575e-02
  -1.77162867e-03   1.67319894e-02   1.97615668e-02   9.83245087e-03
   6.22710723e-03  -1.37732255e-03]
predicted=11.871275, expected=12.900000
predicted=13.053794, expected=14.600000
predicted=13.532591, expected=14.000000
predicted=13.243126, expected=13.600000
predicted=13.091438, expected=13.500000
predicted=13.146989, expected=15.700000
predicted=13.176153, expected=13.000000
Test MSE: 1.502
'''

AR model选择了先前29天的数据作为输入。输出可以看到这29天的值分别对应的系数，通过这些系数我们可以清楚的了解到每一天对模型预测的贡献。最后得到预测结果和MSE的值，可以看出比我们的baseline要好了不少。

在上面例子中，我们的模型一次预测了整个test集。

当有新的观测值时，AR model需要重新训练来学习这个新观测值吗？如果模型复杂度不高的话，是可以的，但是也有更好的方法。我们可以保存AR model的系数，当新的观测值来时，再手动计算预测值。即保存 $\hat y = b0 + b1 \cdot X1 + b2 \cdot X2 + ... + bn \cdot Xn$ 中的系数，以便对新的 $X$ 进行多次预测。

现在，我们用walk-forward预测方法，每次进行下一时刻 $t+1$ 的预测时，都将当前时刻 t <script type="math/tex" id="MathJax-Element-119">t</script>的真实值当做新的观测值加入输入列表。

完整的AR model代码如下：

from pandas import Series
from matplotlib import pyplot
from statsmodels.tsa.ar_model import AR
from sklearn.metrics import mean_squared_error
series = Series.from_csv('daily-minimum-temperatures.csv', header=0)
# split dataset
X = series.values
train, test = X[1:len(X)-7], X[len(X)-7:]
# train autoregression
model = AR(train)
model_fit = model.fit()
window = model_fit.k_ar
coef = model_fit.params
# walk forward over time steps in test
history = train[len(train)-window:]
history = [history[i] for i in range(len(history))]
predictions = list()
for t in range(len(test)):
    length = len(history)
    lag = [history[i] for i in range(length-window,length)]
    yhat = coef[0]
    for d in range(window):
        yhat += coef[d+1] * lag[window-d-1]
    obs = test[t]
    predictions.append(yhat)
    history.append(obs)
    print('predicted=%f, expected=%f' % (yhat, obs))
error = mean_squared_error(test, predictions)
print('Test MSE: %.3f' % error)
# plot
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()
'''output
predicted=11.871275, expected=12.900000
predicted=13.659297, expected=14.600000
predicted=14.349246, expected=14.000000
predicted=13.427454, expected=13.600000
predicted=13.374877, expected=13.500000
predicted=13.479991, expected=15.700000
predicted=14.765146, expected=13.000000
Test MSE: 1.451
'''