时间序列分析-linear-models-to-GARCH

最新推荐文章于 2025-07-17 17:52:11 发布

翻译最新推荐文章于 2025-07-17 17:52:11 发布 · 3.7k 阅读

文章标签：

#ARMA #TSA

数据分析专栏收录该内容

6 篇文章

订阅专栏

本文介绍了时间序列分析的基础知识，包括时间序列的定义、平稳性、自相关性及其重要性。讨论了白噪声和随机游走过程，并详细探讨了线性模型、对数线性模型、自回归模型、滑动平均模型、自回归滑动平均模型以及自回归条件异方差模型和广义自回归条件异方差模型。内容涵盖了模型的构建、验证和参数估计，强调了在金融领域中时间序列模型的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

重点

稳态时间序列要满足三个条件：
1. 均值不随时间变化
2. 方差不随时间变化
3. 协方差不随时间变化
验证一个TSM的正确性的方法是验证其残差是否是白噪声
random walk process可以建模，但无法做预测？
时间序列分析的套路是不断分解目标序列，提取趋势/周期性信息，直至残留信息是白噪声序列为止
时间序列分析TSA的套路,一个尝试各种已知模型的过程,通过对残差的白噪声验证确定模型的有效性,利用AIC/BIC等方式控制模型复杂度
ARCH(p)模型AR(p)是作用在方差上的模型
GARCH(p,q)模型是ARMA(p,q)作用在方差上的模型,而且它处理的序列本身很接近白噪声序列,但是平方后表现出自相关性

源

The Basics

What is a Times Series?

A time series is a series of data points indexed(or listed or graphed) in time order

Stationarity

下图来自SeanAbu.com,给出stationarity直观的解释
这里写图片描述

为什么要关注于数据的stationarity？
* 只有当数据未来的统计属性和现在的统计属性相关时，时序数据才容易被预测
* Time Series Analysis（TSA）中大部分模型都假设 convariance-stationarity（上图中的#3）所以只有当TS时stationarity时，模型预测出的统计属性，比如means，variances和correlations才是可靠的。

"For example, if the series is consistently increasing over time, the sample mean and variance will grow with the size of the sample, and they will always underestimate the mean and variance in future periods. And if the mean and variance of a series are not well-defined, then neither are its correlations with other variables." - http://people.duke.edu/~rnau/411diff.htm

实践中，金融领域的TS都不是stationary的。TSA中大部分工作是如何分辨目标序列是否是stationary的，如果不是则需要找到方法把其变成stationary的。

Serial Correlation（Autocorrelation）

本质上我们对一个序列建模就是把序列分解成三部分：趋势，周期性和随机性。随机部分称为残差和错误，是预测值和观测值之间的差异。

Why Do We Care about Serial Correlation？

Serial Correlation本质上和stationarity有关，对于模型预测结果的有效性具有重要作用。根据定义，stationary TS模型的残差是相互无关的。模型中如果没考虑到这个会导致模型系数的估测不准，增大T-statistics，增大Type-1错误。In Layman’s terms,ignoring automcorrelation means our model predictions will be bunk,and we’re likey to draw incorrect conclusions about the impack of the independent variables in our model.

译注：上述两个小节似乎把序列和模型混在一起了，建模是把序列分成趋势/周期/随机三个部分分开建模，其中随机部分就是模型的残差，残差序列不能是自相关的，必须是独立的。一个模型预测的残差是自相关的，则模型有问题

White Noise and Random Walks

我们接触到的第一个Time Series Model（TSM）是White noise（最简单的TSM）。White noise process（译注：模型）的残差是不相关，而且残差的期望是0.这种序列的不相关残差也称为independent and identically distributed（i.i.d). 如果一个TSM成功拟合到真正的过程，模型的残差应该是i.i.d,就是white noise process。TSA的一部分工作就是学习一个模型，使得模型的残差和white noise不可区分。
下面的代码来自于 Seanabu.com,可以模拟并显示white noise process。

def tsplot(y, lags=None, figsize=(10, 8), style='bmh'):
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        #mpl.rcParams['font.family'] = 'Ubuntu Mono'
        layout = (3, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        qq_ax = plt.subplot2grid(layout, (2, 0))
        pp_ax = plt.subplot2grid(layout, (2, 1))

        y.plot(ax=ts_ax)
        ts_ax.set_title('Time Series Analysis Plots')
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.5)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.5)
        sm.qqplot(y, line='s', ax=qq_ax)
        qq_ax.set_title('QQ Plot')        
        scs.probplot(y, sparams=(y.mean(), y.std()), plot=pp_ax)

        plt.tight_layout()
    return

下面的代码生成一个white noise process并显示输出。

np.random.seed(1)

# plot of discrete white noise
randser = np.random.normal(size=1000)
tsplot(randser, lags=30)

white-noise-1

上述序列表现出0附近的随机性。而且autocorrelation（ACF）和partial autocorrelation（PACF）没有明显的自相关。Keep in mind we should see approximately 5% significance in the autocorrelation plots due to pure chance as a result of sampling from the Normal distribution。在QQ和Probability Plots上，我们把测试数据和标准的正态分布做比较，如图所示，测试数据的确和一个正态白噪声标准模型匹配。

p("Random Series\n -------------\nmean: {:.3f}\nvariance: {:.3f}\nstandard deviation: {:.3f}"
.format(randser.mean(), randser.var(), randser.std()))

# Random Series
# -------------
# mean: 0.039 
# variance: 0.962
# standard deviation: 0.981

A Random Walk is defined below:

Random-walk-def

random walk是非稳态的，因为其covariance和时间相关，random walk对应的TS是无法准确预测的。

下面的代码利用numpy库从标准正态分布中采样生成一个random walk序列

# Random Walk without a drift

np.random.seed(1)
n_samples = 1000

x = w = np.random.normal(size=n_samples)
for t in range(n_samples):
    x[t] = x[t-1] + w[t]

_ = tsplot(x, lags=30)

Random-walk-1

显然这个TS是非稳态的。来验证一下random walk model是否是符合我们模拟的数据。因为random walk的定义是 $x_t = x_{t-1}+w_t$ ,易得 $x_t - x_{t-1}=w_t$ ，即random walk 序列的一阶差分就是白噪声过程！我们用”np.diff()”在模拟数据上做验证。

# First difference of simulated Random Walk series

_ = tsplot(np.diff(x), lags=30)

diff_random-walk-1

译注：如果一个序列是random walk process，则可以用 random walk model对齐建模，但是无法准确预测？？

Linear Models

Linear models又称trend models可以描述用直线可视化的时间序列，其方程如下：

y t = β 0 + β 1 t + ϵ t

$y_t = \beta_0 + \beta_1 t + \epsilon _t$
因变量由系数

β β $\beta$ 和自变量时间

t t $t$ 决定。比如公司销售额每隔一段时间会增加固定的值，就是一个linear models的例子。比如下面的例子，假设ABC公司销量固定销量是-50美元（

β = 0

$\beta=0$ ，即截距），每一个固定时间段内增加25美元。

# simulate linear trend
# example Firm ABC sales are -$50 by default and +$25 at every time step

w = np.random.randn(100)
y = np.empty_like(w)

b0 = -50.
b1 = 25.
for t in range(len(w)):
    y[t] = b0 + b1*t + w[t]

_ = tsplot(y, lags=lags)

linear-models

可以看到模型的残差是相关的，而且随时间间隔lag线性降低，近似正态分布。在利用这个模型进行预测之前，需要线消除明显的自相关性。延时1时PACF出现一个峰值说明合适的模型可能是autoregressive模型

译注：上图中的第一个图“Time Series”不是一个直线，由随机噪声，只是噪声的幅度小于信号幅度。另外上图中应该也没有error/residuals出现，没有模型自然不存在残差。图中只能看到序列的自相关随时间延迟lag而降低，QQ和Probability显示和正态分布接近，但在首尾处并不严格匹配

Log-Linear Models

这个模型和Linear models相似，但是数据点形成一个指数函数分布，表示每个时间间隔内由固定的变化率。比如ABC公司销售额每段时间增加X%，如下所示：

# Simulate ABC exponential growth

# fake dates
idx = pd.date_range('2007-01-01', '2012-01-01', freq='M')

# fake sales increasing at exponential rate
sales = [np.exp( x/12 ) for x in range(1, len(idx)+1)]

# create dataframe and plot
df = pd.DataFrame(sales, columns=['Sales'], index=idx)

with plt.style.context('bmh'):
    df.plot()
    plt.title('ABC Sales')

log-linear-models

取自然对数后linear regression就可以拟合数据

# ABC log sales 

with plt.style.context('bmh'):
    pd.Series(np.log(sales), index=idx).plot()
    plt.title('ABC Log Sales')

log-log-linear-models

如前面讨论，这些模型都假设序列无关的残差，但是前面的linear model的例子已经说明这个假设是错误的。实际生活中，TS数据往往不符合稳态的假设，由此人们设计了autoregressive models。

译注：对这个两个模型出现在此处的含义不明确，这两个基于时间t建立的模型不符合稳态性，按照最开始的描述，非稳态时间序列不能用常见的平稳序列模型建模。

Autoregressive Models-AR（p）

autoregressive model中假设因变量依赖于同序列中历史中若干项，公式如下

x t = a 1 x t - 1 + . . . + a p x t - p + w t = \sum i = 1 P a i x t - i + w t

$x_t = a_1x_{t-1}+...+a_px_{t-p}+w_t = \sum_{i=1}^{P}a_ix_{t-i}+w_t$
其中“P”是模型的阶，表示延时lag。例如二阶AR模型AR（2）公式如下：

x t = a 1 x t - 1 + a 2 x t - 2 + w t

$x_t = a_1x_{t-1} + a_2x_{t-2}+w_t$
其中

a a $a$ 是系数，

w

$w$ 是white noise 项。AR模型中

a a $a$ 不能为0。AR(1)中令

a = 1

$a=1$ 就是random walk序列,非稳态

x t = x t - 1 + w t

$x_t=x_{t-1}+w_t$

译注: AR(1)中a != 1就是稳态过程了??

下面代码模拟AR(1)模型,而且 $a = 0.6$

# Simulate an AR(1) process with alpha = 0.6

np.random.seed(1)
n_samples = int(1000)
a = 0.6
x = w = np.random.normal(size=n_samples)

for t in range(n_samples):
    x[t] = a*x[t-1] + w[t]

_ = tsplot(x, lags=lags)

ar(1)-0.6

如上图所示,这个从AR(1)模拟的序列符合正态分布,延时的序列之间存在相关性,尤其是延时 $lag=1$ 时存在强相关性,如PACF所示。

我们利用python的statsmodels拟合上述的测试数据，返回alpha系数。然后利用statsmodels的”select_order()”返回lag值.如果AR模型可行,返回的alpha应该接近0.6,order值等于1.

# Fit an AR(p) model to simulated AR(1) model with alpha = 0.6

mdl = smt.AR(x).fit(maxlag=30, ic='aic', trend='nc')
%time est_order = smt.AR(x).select_order(
    maxlag=30, ic='aic', trend='nc')

true_order = 1
p('\nalpha estimate: {:3.5f} | best lag order = {}'
  .format(mdl.params[0], est_order))
p('\ntrue alpha = {} | true order = {}'
  .format(a, true_order))


alpha estimate: 0.58227 | best lag order = 1
true alpha = 0.6 | true order = 1

如上我们成功发现模拟数据对应模型的参数. 下面模拟AR(2)过程,设置 $\alpha _1 = 0.666 \ \ \alpha _2=-0.333$ . 利用statsmodels的”arma_generate_samples()”函数,它可以用来模拟任意阶的AR模型.

# Simulate an AR(2) process

n = int(1000)
alphas = np.array([.666, -.333])
betas = np.array([0.])

# Python requires us to specify the zero-lag value which is 1
# Also note that the alphas for the AR model must be negated
# We also set the betas for the MA equal to 0 for an AR(p) model
# For more information see the examples at statsmodels.org
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]

ar2 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n) 
_ = tsplot(ar2, lags=lags)

ar(2)_0.666_-0.333

利用下面的代码尝试复现模型参数

# Fit an AR(p) model to simulated AR(2) process

max_lag = 10
mdl = smt.AR(ar2).fit(maxlag=max_lag, ic='aic', trend='nc')
est_order = smt.AR(ar2).select_order(
    maxlag=max_lag, ic='aic', trend='nc')

true_order = 2
p('\ncoef estimate: {:3.4f} {:3.4f} | best lag order = {}'
  .format(mdl.params[0],mdl.params[1], est_order))
p('\ntrue coefs = {} | true order = {}'
  .format([.666,-.333], true_order))

# coef estimate: 0.6291 -0.3196 | best lag order = 2
# true coefs = [0.666, -0.333] | true order = 2

Moving Average Models-MA(q)

MA(q)模型和AR(p)类似,但是MA(q)是利用历史白噪声的错误项线性组合,而不是历史观测值的线性组合. MA(q)的出发点是利用错误项直接接触到”震荡”,而AR(p)模型是用ACF间接接触震荡.MA(q)数学公式如下

x t = w t + β t w t - 1 + . . . + β p w t - p = w t + \sum i = 1 P β t w t - i

$x_t = w_t+\beta _t w_{t-1} + ... + \beta _p w_{t-p} = w_t + \sum_{i=1}^{P}\beta _t w_{t-i}$

w w $w$ 是白噪声项,期望为0 and variance of sigma squared. 下面的代码利用statsmodel的arma_generate_sample()函数模拟MA(1),令beta=0.6,并设置AR(p)的
alpha = 0

# Simulate an MA(1) process

n = int(1000)

# set the AR(p) alphas equal to 0
alphas = np.array([0.])
betas = np.array([0.6])

# add zero-lag and negate alphas
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]

ma1 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n) 
_ = tsplot(ma1, lags=30)

ma(1)_0.6

译注: 给定一个时间序列,如何获得white noise序列构造MA(q)模型???

ACF中lag=1很大显示MA(1)可能比较适合目标序列.但此时PACF中lag=2,3,4的幅值很大的原因不详.不论如何,我们可以先尝试用MA(1)对目标序列建模.用选定的order值调用statsmodels的ARMA()函数,利用fit()返回模型输出

# Fit the MA(1) model to our simulated time series
# Specify ARMA model with order (p, q)

max_lag = 30
mdl = smt.ARMA(ma1, order=(0, 1)).fit(
    maxlag=max_lag, method='mle', trend='nc')
p(mdl.summary())

ma(1)_model_summary

模型成功预测lag系数为0.58,而真实值是0.6. Also notice that our 95% confidence interval does contain the true value. 下面利用ARMA函数拟合三阶MA模型,验证是否能够复原出真实的系数,0.6,0.4和0.2

# Simulate MA(3) process with betas 0.6, 0.4, 0.2

n = int(1000)
alphas = np.array([0.])
betas = np.array([0.6, 0.4, 0.2])
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]

ma3 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n)
_ = tsplot(ma3, lags=30)

ma(3)_0.6_0.4_0.2

# Fit MA(3) model to simulated time series

max_lag = 30
mdl = smt.ARMA(ma3, order=(0, 3)).fit(
    maxlag=max_lag, method='mle', trend='nc')
p(mdl.summary())

ma(3)_model_summary

模型成功预测真实的参数.Our 95% confidence intervals also contain the true parameter values of 0.6, 0.4 和 0.3 (译注:最后一个真值不是0.2?)

Autoregressive Moving Average Models-ARMA(p,q)

ARMA就是AR和MA模型的融合,从量化金融的角度重新解读这些模型:
* AR(p)模型目标在于解释交易市场中的的趋势(均值和势头)
* MA(q)模型目标是解释白噪声中出现的震荡.这些震荡可以被理解为一些影响观测值的意外事件,比如意外利润,恐怖袭击等

"For a set of products in a grocery store, the number of active coupon campaigns introduced at different times would constitute multiple 'shocks' that affect the prices of the products in question."
AM207: Pavlos Protopapas, Harvard University

ARMA’s weakness is that it ignores the volatility clustering effects found in most financial time series.

模型公式如下:

x_{t} = a_{1} x_{t - 1} + a_{2} x_{t - 2} + . . . + w_{t} + β_{1} w_{t - 1} + β_{2} w_{t - 2} + . . . + β_{q} w_{t - q} = \sum_{i = 1}^{P} a_{i} x_{t - i} + w_{t} + \sum_{i = 1}^{Q} β_{i} w_{t - i}

$x_t = a_1x_{t-1}+a_2x_{t-2}+...+w_t+\beta _1 w_{t-1} + \beta _2 w_{t-2} + ... + \beta _q w_{t-q} \\= \sum^{P}_{i=1}a_i x_{t-i} + w_t + \sum^Q_{i=1}\beta _i w_{t-i}$

利用给定的参数模拟ARMA(2,2)过程,然后利用ARMA(2,2)拟合测试序列,验证模型是否能够复现参数. 令 $a = \{0.5, -0.25\}, \beta = \{0.5,-0.3\}$

# Simulate an ARMA(2, 2) model with alphas=[0.5,-0.25] and betas=[0.5,-0.3]
max_lag = 30

n = int(5000) # lots of samples to help estimates
burn = int(n/10) # number of samples to discard before fit

alphas = np.array([0.5, -0.25])
betas = np.array([0.5, -0.3])
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]

arma22 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n, burnin=burn)
_ = tsplot(arma22, lags=max_lag)

mdl = smt.ARMA(arma22, order=(2, 2)).fit(
    maxlag=max_lag, method='mle', trend='nc', burnin=burn)
p(mdl.summary())

arma(2,2)_processing

arma(2,2)_model_summary

模型成功复现了参数,and our true parameters are contained within the 96% confidence interval.

下面我们模拟测试ARMA(3,2)模型,在目标序列上测试各种p,q组合之后,我们选择对应Akaike Information Criterion(AIC)最小的一个组合作为最终结果

# Simulate an ARMA(3, 2) model with alphas=[0.5,-0.25,0.4] and betas=[0.5,-0.3]

max_lag = 30

n = int(5000)
burn = 2000

alphas = np.array([0.5, -0.25, 0.4])
betas = np.array([0.5, -0.3])

ar = np.r_[1, -alphas]
ma = np.r_[1, betas]

arma32 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n, burnin=burn)
_ = tsplot(arma32, lags=max_lag)

# pick best order by aic 
# smallest aic value wins
best_aic = np.inf 
best_order = None
best_mdl = None

rng = range(5)
for i in rng:
    for j in rng:
        try:
            tmp_mdl = smt.ARMA(arma32, order=(i, j)).fit(method='mle', trend='nc')
            tmp_aic = tmp_mdl.aic
            if tmp_aic < best_aic:
                best_aic = tmp_aic
                best_order = (i, j)
                best_mdl = tmp_mdl
        except: continue


p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))

# aic: 14108.27213 | order: (3, 2)

序列对应的阶数被正确复现.

Autoregressive Integrated Moving Average Models - ARIMA(p,d,q)

ARIMA是对ARMA的推广, 如前文所述,很多TS都是非稳态的,但是通过差分可以把他们转换成稳态的.我们对高斯random walk序列取一阶差分,并证明结果是white noise序列.
避免深入复杂的数学公式,只需要知道这里的”d”是指我们对序列差分的次数.在python中,如果要做多次差分,需要使用np.diff()函数.pandas中的DataFrame.diff()/Series.diff()只支持一次差分.

Autoregressive Conditionally Heteroskedastic Models - ARCH(p)

ARCH(p)模型可以认为是作用在时间序列的方差上的AR(p)模型, 或另一个角度的理解是时间序列当前的方差依赖于过去一段时间的方差(译注:方差是乘性的)

V a r (y t | y t - 1) = σ 2 t = a 0 + a 1 y 2 t - 1

$Var(y_t | y_{t-1}) = \sigma _t^2=a_0 + a_1 y^2_{t-1}$
假设序列均值为0,模型可以描述如下

y t = σ t ϵ t, w i t h σ t = a 0 + a 1 y 2 t - 1 - - - - - - - - - \sqrt, a n d σ t i i d (0, 1)

$y_t = \sigma _t \epsilon _t, \ with \ \sigma _t = \sqrt{a_0+a_1y^2_{t-1}}, \ and \ \sigma _t ~ iid(0,1)$

# Simulate ARCH(1) series
# Var(yt) = a_0 + a_1*y{t-1}**2
# if a_1 is between 0 and 1 then yt is white noise

np.random.seed(13)

a0 = 2
a1 = .5

y = w = np.random.normal(size=1000)
Y = np.empty_like(y)

for t in range(len(y)):
    Y[t] = w[t] * np.sqrt((a0 + a1*y[t-1]**2))

# simulated ARCH(1) series, looks like white noise
tsplot(Y, lags=30)

arch(1)_processing

arch(1)**2_processing

Generalized Autoregressive Conditionally Heteroskedastic Models - GARCH(p,q)

ARMA模型作用到时间序列的variance上就是GARCH(p,q), The AR(p) models the variance of the residuals (squared errors) or simply our time series squared. The MA(q) portion models the variance of the process. The basic GARCH(1, 1) formula is:

ϵ t = σ t w t σ 2 t = a 0 + a 1 ϵ 2 t - 1 + β 1 σ 2 t - 1

$\epsilon_t=\sigma_tw_t \\ \sigma_t^2=a_0 + a_1 \epsilon^2_{t-1} + \beta_1 \sigma^2_{t-1}$

w w $w$ 是白噪声,

a

$a$ 和

β β $\beta$ 是模型参数,而且

α1+β1 α 1 + β 1 $\alpha_1 + \beta_1$ 必须小于1,否则模型将是不稳定的(译注:unstable). 如下的代码模拟GARCH(1,1)过程

# Simulating a GARCH(1, 1) process

np.random.seed(2)

a0 = 0.2
a1 = 0.5
b1 = 0.3

n = 10000
w = np.random.normal(size=n)
eps = np.zeros_like(w)
sigsq = np.zeros_like(w)

for i in range(1, n):
    sigsq[i] = a0 + a1*(eps[i-1]**2) + b1*sigsq[i-1]
    eps[i] = w[i] * np.sqrt(sigsq[i])

_ = tsplot(eps, lags=30)

garch(1,1)_processing

序列很接近白噪声,但是平方一下再看看

garch(1,1)2_processing

译注:这个序列本身很接近白噪声(没有信息可供提取),但是平方后表现出很强的相关性

ACF和PACF中同时显现处明显的自相关说明同时需要AR和MA两个组建.我们利用ARCH包里的arch_model()函数恢复GARCH(1,1)的参数.

# Fit a GARCH(1, 1) model to our simulated EPS series
# We use the arch_model function from the ARCH package

am = arch_model(eps)
res = am.fit(update_freq=5)
p(res.summary())

garch_model_summary