Gaussian Process(高斯过程) GPSS暑校笔记总计三(英文版)——核设计

本文围绕高斯过程的核函数展开,介绍了核函数的定义,指出其对应GP的协方差且为对称半正定函数。阐述了选择合适核函数的方法,需考虑函数的平稳性、可微性等。还说明了如何由旧核构造新核,如求和、乘积、与函数复合等,并通过CO2示例进行说明,最后提及周期性检测。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

What is a kernel?

Theorem (Leove):

kkk corresponds to the covariance of a GP.

kkk is a symmetric positive semi-definite function.

when kkk is a function of x−yx-yxy , the kernel is called stationary, σ\sigmaσ is called the variance and θ\thetaθ is called the lengthscale. And it is quite import to look at the length scale after the optimization step, if the value is quite small that means your model will gain no information from the surrounding, and thus no good prediction for the testing points.

Choosing appropriate kernel

In order to choose a kernel, one should gather all possible
information about the function to approximate:

  • Is it stationary ?
  • Is it differentiable, what’s its regularity ?
  • Do we expect particular trends ?
  • Do we expect particular patterns (periodicity, cycles, additivity) ?

Kernels often include rescaling parameters : θ for the x axis (length-scale) and σ for the y (σ2\sigma^{2}σ2 often corresponds to the GP variance). They can be tuned by

  • maximizing the likelihood
  • minimizing the prediction error

It is common to try various kernels and to asses the model accuracy. The idea is to compare some model predictions against actual values :

  • On a test set
  • Using leave-one-out

Furthermore, it is often interesting to try some input remapping such as x→log⁡(x),x→exp⁡(x)x \rightarrow \log (x), x \rightarrow \exp (x)xlog(x),xexp(x) to make our data set stationary, and then choose to use the stationary kernel.

Making new from old

  • Summed together
    • On the same space k(x, y) = k1(x, y) + k2(x, y)
    • On the tensor space k(x, y) = k1(x1, y1) + k2(x2, y2)
  • Multiplied together
    • On the same space k(x, y) = k1(x, y) × k2(x, y)
    • On the tensor space k(x, y) = k1(x1, y1) × k2(x2, y2)
  • Composed with a function
    • k(x, y) = k1(f (x), f (y))

Example: CO2CO_2CO2

1560890999880

  • First, we consider a squared-exponential kernel:
    k(x,y)=σ2exp⁡(−(x−y)2θ2) k(x, y)=\sigma^{2} \exp \left(-\frac{(x-y)^{2}}{\theta^{2}}\right) k(x,y)=σ2exp(θ2(xy)2)

1560891091853

First, we would like to say that we observe the high frequency in the data, so we would like to choose a very small length scale value, the result is shown as the left picture. The reason is that, you choose the small scale value, when you are in 2020, the model will get no information from the data set and not influenced by the past values.

Second choice is to be focus on the trend, with low frequency which would lead to a very large length scale value, as shown in picture right. But the confidence interval is over confident.

  • Second, we sum both kernels together.
    k(x,y)=krbf1(x,y)+krbf2(x,y) k(x, y)=k_{r b f 1}(x, y)+k_{r b f 2}(x, y) k(x,y)=krbf1(x,y)+krbf2(x,y)
    1560891556775

One thing to notice that, even in the second choice (combination choice), it seems that we would have more parameters, but indeed the optimization process will get easier than the first choice. Because for the first two model, the likelihood will get very very small, since both assumptions make sense for data we have.

  • Then, adding the periodic term into the kernel.
    k(x,y)=σ02x2y2+krbf1(x,y)+krbf2(x,y)+kper(x,y) k(x, y)=\sigma_{0}^{2} x^{2} y^{2}+k_{r b f 1}(x, y)+k_{r b f 2}(x, y)+k_{p e r}(x, y) k(x,y)=σ02x2y2+krbf1(x,y)+krbf2(x,y)+kper(x,y)

1560891925621

Sum of kernels over tensor space

Property:
k(x,y)=k1(x1,y1)+k2(x2,y2) k(\mathbf{x}, \mathbf{y})=k_{1}\left(x_{1}, y_{1}\right)+k_{2}\left(x_{2}, y_{2}\right) k(x,y)=k1(x1,y1)+k2(x2,y2)
is a valid covariance structure.

1560893886801

Tensor Additive kernels are very useful for:

  • Approximating additive functions
  • Building models over high dimensional input space

Remark:

  1. From a GP point of view, kkk is the kernel of $Z(x) = Z(x_1) + Z(x_2) $.
  2. It is straightforward to show that the mean predictor is additive.

m(x)=(k1(x,X)+k2(x,X))(k(X,X))−1F=k1(x1,X1)(k(X,X))−1F⎵m1(x1)+k2(x2,X2)(k(X,X))−1F⎵m2(x2) \begin{aligned} m(\mathbf{x}) &=\left(k_{1}(x, X)+k_{2}(x, X)\right)(k(X, X))^{-1} F \\ &=\underbrace{k_{1}\left(x_{1}, X_{1}\right)(k(X, X))^{-1} F}_{m_{1}\left(x_{1}\right)}+\underbrace{k_{2}\left(x_{2}, X_{2}\right)(k(X, X))^{-1} F}_{m_{2}\left(x_{2}\right)} \end{aligned} m(x)=(k1(x,X)+k2(x,X))(k(X,X))1F=m1(x1)k1(x1,X1)(k(X,X))1F+m2(x2)k2(x2,X2)(k(X,X))1F

  1. The prediction variance has interesting features.

    1560894134746

The right one comes from a additive kernel, as we can see even in the area which is away from the observation points, the variance is not too high. The reason for that our prior, e.g. kernel is additive, we already three observations which would form a rectangle, and our prediction would be the fourth vertex, thus the variance would be small. All the prior would retrieve it in the posterior.

This property can be used to construct a design of experiment that covers the space especially for the high-D input space, with only cst×dcst × dcst×d points.

1560894678169

Product over the same space

Property:
k(x,y)=k1(x,y)×k2(x,y) k(x, y)=k_{1}(x, y) \times k_{2}(x, y) k(x,y)=k1(x,y)×k2(x,y)
is valid covariance structure.

1560894868103

Product over the tensor space

k(x,y)=k1(x1,y1)×k2(x2,y2) k(\mathbf{x}, \mathbf{y})=k_{1}\left(x_{1}, y_{1}\right) \times k_{2}\left(x_{2}, y_{2}\right) k(x,y)=k1(x1,y1)×k2(x2,y2)

1560895062083

Composition with a function

k(x,y)=k1(f(x),f(y))Proof:∑∑aiajk(xi,xj)=∑∑aiajk1(f(xi)⎵yi,f(xj)⎵yj)≥0 k(x, y)=k_{1}(f(x), f(y))\\ Proof:\\ \sum \sum a_{i} a_{j} k\left(x_{i}, x_{j}\right)=\sum \sum a_{i} a_{j} k_{1}\left(\underbrace{f\left(x_{i}\right)}_{y_{i}}, \underbrace{f\left(x_{j}\right)}_{y_{j}}\right) \geq 0 k(x,y)=k1(f(x),f(y))Proof:aiajk(xi,xj)=aiajk1yif(xi),yjf(xj)0

This can be seen as a nonlinear rescaling of the input space.

Periodicity detection

Given a few observations can we extract the periodic part of a signal ?

As previously we will build a decomposition of the process in two independent GPs :
Z=Zp+Za Z=Z_{p}+Z_{a}\\ Z=Zp+Za
${\text { where } Z_{p} \text { is a GP in the span of the Fourier basis }} $
B(t)=(sin⁡(t),cos⁡(t),…,sin⁡(nt),cos⁡(nt))t {B(t)=(\sin (t), \cos (t), \ldots, \sin (n t), \cos (n t))^{t}} B(t)=(sin(t),cos(t),,sin(nt),cos(nt))t
Note that the aperiodic means the projection of the cos−sincos-sincossin space will end up with zero.

And it can be proved that
kp(x,y)=B(x)tG−1B(y)ka(x,y)=k(x,y)−kp(x,y) \begin{array}{l}{k_{p}(x, y)=B(x)^{t} G^{-1} B(y)} \\ {k_{a}(x, y)=k(x, y)-k_{p}(x, y)}\end{array} kp(x,y)=B(x)tG1B(y)ka(x,y)=k(x,y)kp(x,y)
where GGG is the Gram Matrix associated to BBB in the RKHS.

As previously, a decomposition of the model comes with a decomposition of the kernel:
m(t)=(kp(x,X)+ka(x,X))k(X,X)−1F=kp(x,X)k(X,X)−1F+ka(x,X)k(X,X)−1F⎵ aperiodic sub-model ma \begin{aligned} m(t)=&\left(k_{p}(x, X)+k_{a}(x, X)\right) k(X, X)^{-1} F \\=& k_{p}(x, X) k(X, X)^{-1} F+\underbrace{k_{a}(x, X) k(X, X)^{-1} F}_{\text { aperiodic sub-model } m_{a}} \end{aligned} m(t)==(kp(x,X)+ka(x,X))k(X,X)1Fkp(x,X)k(X,X)1F+ aperiodic sub-model maka(x,X)k(X,X)1F
and we can associate a prediction variance to the sub-models:
vp(t)=kp(x,x)−kp(x,X)tk(X,X)−1kp(t)va(t)=ka(x,x)−ka(x,X)tk(X,X)−1ka(t) \begin{aligned} v_{p}(t) &=k_{p}(x, x)-k_{p}(x, X)^{t} k(X, X)^{-1} k_{p}(t) \\ v_{a}(t) &=k_{a}(x, x)-k_{a}(x, X)^{t} k(X, X)^{-1} k_{a}(t) \end{aligned} vp(t)va(t)=kp(x,x)kp(x,X)tk(X,X)1kp(t)=ka(x,x)ka(x,X)tk(X,X)1ka(t)
1560940503339

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值