Task4-tppca的理论推导-1
(该博客是自己在完成老师任务时经过查阅资料自行推导并整理的笔记,在变量的字母符号设置上前后有所差异,但不影响阅读,如有错误,请联系我,此笔记仅用于学习,如需转载,请注明来源,谢谢!)
1 tppca模型
1.1 预备知识
tppca模型是ppca模型在t分布下的推广,因此,要定义tppca模型,必先了解多元t分布 t ( μ , Σ , ν ) t(\mu,\Sigma,\nu) t(μ,Σ,ν)的概率密度:
p ( t ∣ μ , Σ , v ) = Γ ( ( v + d ) / 2 ) ∣ Σ ∣ − 1 / 2 Γ ( v / 2 ) ( v π ) d / 2 × [ 1 + ( t − μ ) T Σ − 1 ( t − μ ) / v ] − ( v + d ) / 2 \begin{aligned} p(\mathbf{t} \mid \boldsymbol{\mu}, \mathbf{\Sigma}, v)=& \frac{\Gamma((v+d) / 2)|\mathbf{\Sigma}|^{-1 / 2}}{\Gamma(v / 2)(v \pi)^{d / 2}} \\ & \times\left[1+(\mathbf{t}-\boldsymbol{\mu})^{\mathrm{T}} \mathbf{\Sigma}^{-1}(\mathbf{t}-\boldsymbol{\mu}) / v\right]^{-(v+d) / 2} \end{aligned} p(t∣μ,Σ,v)=Γ(v/2)(vπ)d/2Γ((v+d)/2)∣Σ∣−1/2×[1+(t−μ)TΣ−1(t−μ)/v]−(v+d)/2
其中, Σ \mathbf{\Sigma} Σ是对称和正定的,如果 ν > 1 \nu>1 ν>1时 μ \boldsymbol{\mu} μ是均值向量,如果 ν > 2 \nu>2 ν>2时, Σ ν ν − 2 \frac{\mathbf{\Sigma} \nu}{\nu-2} ν−2Σν是协方差矩阵;多元正态分布 N ( μ , Σ ) \mathscr{N}(\boldsymbol{\mu}, \mathbf{\Sigma}) N(μ,Σ)是 t ( μ , Σ , ∞ ) t(\boldsymbol{\mu}, \mathbf{\Sigma}, \infty) t(μ,Σ,∞)的极限。
通常直接对t分布求解参数是困难的,因此常常把多元t分布看作是一个特殊的高斯混合。例如:要从多元t分布 t ( μ , Σ , ν ) t(\mu,\Sigma,\nu) t(μ,Σ,ν)中抽取一个t向量,我们引入一个潜变量 τ \tau τ,且 τ ∼ G a m m a ( v / 2 , v / 2 ) \tau \sim Gamma(v / 2, v / 2) τ∼Gamma(v/2,v/2),概率密度函数如下:
p ( τ ) = p ( τ ; ν 2 , ν 2 ) = ν 2 ν 2 τ ν 2 − 1 Γ ( ν 2 ) exp { − ν 2 τ } p(\tau)=p(\tau ; \frac{\nu}{2}, \frac{\nu}{2})=\frac{\frac{\nu}{2}^{\frac{\nu}{2}} \tau^{\frac{\nu}{2}-1}}{\Gamma(\frac{\nu}{2})} \exp \{-\frac{\nu}{2} \tau\} p(τ)=p(τ;2ν,2ν)=Γ(2ν)2ν2ντ2ν−1exp{−2ντ}
当 τ \tau τ给定时,向量 t t t是从多元正态分布 N ( μ , Σ / τ ) \mathscr{N}(\boldsymbol{\mu}, \mathbf{\Sigma} / \tau) N(μ,Σ/τ)中抽取的,则容易验证向量 t t t的边际分布就是 t ( μ , Σ , ν ) t(\mu,\Sigma,\nu) t(μ,Σ,ν)。
具体验证过程如下:
p ( t ) = ∫ 0 + ∞ p ( t ∣ τ ) p ( τ ) d τ = ∫ 0 + ∞ ( 2 π ) − d 2 ∣ Σ τ ∣ − 1 2 exp { − 1 2 ( t − μ ) ⊤ ( Σ τ ) − 1 ( t − μ ) } ⋅ ( ν 2 ) ν 2 Γ ( ν 2 ) ⋅ τ ν 2 − 1 exp { − ν 2 τ } d τ = ( 2 π ) − d 2 ∣ Σ ∣ − 1 2 ( ν 2 ) ν 2 Γ ( ν 2 ) ∫ 0 + ∞ τ d + v 2 − 1 exp { − ( t − μ ) T Σ − 1 ( t − μ ) 2 τ } d τ = ( 2 π ) d 2 ∣ Σ ∣ 1 2 ( ν 2 ) ν 2 Γ ( ν 2 ) Γ ( d + v 2 ) [ ( t − μ ) T Σ − 1 ( t − μ ) 2 + ν ] d + v 2 ⋅ ∫ 0 + ∞ [ ( t − μ ) T Σ − 1 ( t − μ ) 2 ] d + v 2 Γ ( d + v 2 ) τ d + v 2 − 1 exp { − ( t − μ ) T Σ − 1 ( t − μ ) 2 τ } d τ = Γ ( ( v + d ) / 2 ) ∣ Σ ∣ − 1 / 2 Γ ( v / 2 ) ( v π ) d / 2 × [ 1 + ( t − μ ) T Σ − 1 ( t − μ ) / v ] − ( v + d ) / 2 \begin{aligned} p(t) &=\int_{0}^{+\infty} p(t \mid \tau) p(\tau) d \tau \\ &=\int_{0}^{+\infty}(2 \pi)^{-\frac{d}{2}}\left|\frac{\Sigma}{\tau}\right|^{-\frac{1}{2}} \exp \left\{-\frac{1}{2}(t-\mu)^{\top}\left(\frac{\Sigma}{\tau}\right)^{-1}(t-\mu)\right\}\\ &\cdot \frac{\left(\frac{\nu}{2}\right)^{\frac{\nu}{2}}}{\Gamma\left(\frac{\nu}{2}\right)} \cdot \tau^{\frac{\nu}{2}-1} \exp \left\{-\frac{\nu}{2} \tau\right\} d \tau \\ &=(2 \pi)^{-\frac{d}{2}} \mid \Sigma \mid^{-\frac{1}{2}} \frac{\left(\frac{\nu}{2}\right)^{\frac{\nu}{2}}}{\Gamma(\frac{\nu}{2})}\int_{0}^{+\infty} \tau^{\frac{d+v}{2}-1} \exp\{-\frac{(t-\mu)^T \Sigma^{-1} (t-\mu)}{2} \tau\} d \tau \\ &=(2\pi)^{\frac{d}{2}} |\Sigma|^{\frac{1}{2}} \frac{\left(\frac{\nu}{2}\right)^{\frac{\nu}{2}}}{\Gamma(\frac{\nu}{2})} \frac{\Gamma(\frac{d+v}{2})}{[\frac{(t-\mu)^T \Sigma^{-1} (t-\mu)}{2}+\nu]^{\frac{d+v}{2}}} \\ &\cdot \int_{0}^{+\infty} \frac{[\frac{(t-\mu)^T \Sigma^{-1} (t-\mu)}{2}]^{\frac{d+v}{2}}}{\Gamma(\frac{d+v}{2})} \tau^{\frac{d+v}{2}-1} \exp\{-\frac{(t-\mu)^T \Sigma^{-1} (t-\mu)}{2} \tau\} d \tau \\ &= \frac{\Gamma((v+d) / 2)|\mathbf{\Sigma}|^{-1 / 2}}{\Gamma(v / 2)(v \pi)^{d / 2}} \\ & \times\left[1+(\mathbf{t}-\boldsymbol{\mu})^{\mathrm{T}} \mathbf{\Sigma}^{-1}(\mathbf{t}-\boldsymbol{\mu}) / v\right]^{-(v+d) / 2} \end{aligned} p(t)=∫0+∞p(t∣τ)p(τ)dτ=∫0+∞(2π)−2d τΣ −21exp{−21(t−μ)⊤(τΣ)−1(t−μ)}⋅Γ(2ν)(2ν)2ν⋅τ2ν−1exp{−2ντ}dτ=(2π)−2d∣Σ∣−21Γ(2ν)(2ν)2ν∫0+∞τ2d+v−1exp{−2(t−μ)TΣ−1(t−μ)τ}dτ=(2π)2d∣Σ∣21Γ(2ν)(2ν)2ν[2(t−μ)TΣ−1(t−μ)+ν]2d+vΓ(2d+v)⋅∫0+∞Γ(2d+v)[2(t−μ)TΣ−1(t−μ)]2d+vτ2d+v−1exp{−2(t−μ)TΣ−1(t−μ)τ}dτ=Γ(v/2)(vπ)d/2Γ((v+d)/2)∣Σ∣−1/2×[1+(t−μ)TΣ−1(t−μ)/v]−(v+d)/2
也就是说t的边际分布就是对 p ( t , τ ) p(t,\tau) p(t,τ)的联合分布关于 τ \tau τ积分所得,这个分布便是 t ( μ , Σ , ν ) t(\mu,\Sigma,\nu) t(μ,Σ,ν)。
1.2 模型形式
t
n
∣
τ
n
,
x
n
=
W
x
n
+
μ
n
+
ε
n
x
n
∣
τ
n
∼
N
(
0
,
I
/
τ
n
)
,
ε
n
∣
τ
n
∼
N
(
0
,
σ
2
I
/
τ
n
)
τ
n
∼
G
a
(
v
/
2
,
v
/
2
)
\begin{array}{l} \mathbf{t}_n \mid \tau_n, \mathbf{x}_n=\mathbf{W} \mathbf{x}_n+\boldsymbol{\mu_n}+\boldsymbol{\varepsilon_n} \\ \mathbf{x}_n|\tau_n \sim \mathscr{N}(\mathbf{0}, \mathbf{I} / \tau_n), \quad \boldsymbol{\varepsilon_n}| \tau_n \sim \mathscr{N}\left(\mathbf{0}, \sigma^{2} \mathbf{I} / \tau_n \right) \\ \tau_n \sim Ga(v / 2, v / 2) \end{array}
tn∣τn,xn=Wxn+μn+εnxn∣τn∼N(0,I/τn),εn∣τn∼N(0,σ2I/τn)τn∼Ga(v/2,v/2)
其中,
t
n
\mathbf{t}_n
tn表示第
n
n
n个观测数据,维度为
d
×
1
d \times 1
d×1,
x
n
\mathbf{x}_n
xn代表潜在因子,维度为
q
×
1
q \times 1
q×1,噪声
ε
n
\varepsilon_n
εn假设是同方差的,且因子与噪声之间是互不相关的。
该模型中待估参数是: σ 2 , v , μ , W \sigma^{2} ,v ,\boldsymbol{\mu},\mathbf{W} σ2,v,μ,W
2 推导要用的分布
2.1 ( t n ∣ x n , τ n ) (t_{n} \mid x_{n}, \tau_{n}) (tn∣xn,τn)的分布
当
x
n
和
τ
n
x_{n}和 \tau_{n}
xn和τn给定时,两者可以看作是固定的,所以模型公式中
t
n
∣
τ
n
,
x
n
=
W
x
n
+
μ
+
ε
n
\mathbf{t}_n \mid \tau_n, \mathbf{x}_n=\mathbf{W} \mathbf{x}_n+\boldsymbol{\mu}+\boldsymbol{\varepsilon_n}
tn∣τn,xn=Wxn+μ+εn只有
ε
n
\boldsymbol{\varepsilon_n}
εn是随机的,且是正态分布,因此
t
n
t_n
tn也应该服从正态分布。
E
[
t
n
]
=
E
[
w
x
n
+
μ
+
ϵ
n
]
=
x
n
+
μ
+
E
(
ε
n
)
=
w
x
n
+
μ
Cov
(
t
n
)
=
Cov
(
ε
n
)
=
σ
2
I
/
τ
n
\begin{array}{l} E\left[t_{n}\right]=E\left[w x_{n}+\mu+\epsilon_{n}\right]= x_{n}+\mu+E\left(\varepsilon_{n}\right)=w x_{n}+\mu \\ \operatorname{Cov}\left(t_{n}\right)=\operatorname{Cov}\left(\varepsilon_{n}\right)=\sigma^{2} I / \tau_n \end{array}
E[tn]=E[wxn+μ+ϵn]=xn+μ+E(εn)=wxn+μCov(tn)=Cov(εn)=σ2I/τn
故:
t
n
∣
x
n
,
τ
n
∼
N
(
μ
+
w
x
n
,
σ
2
I
/
τ
n
)
\operatorname{t_n} \mid x_{n}, \tau _n \sim N\left(\mu+w x_{n}, \sigma^{2} I / \tau_n\right)
tn∣xn,τn∼N(μ+wxn,σ2I/τn)
2.2 ( t n ∣ τ n ) (t_{n} \mid \tau_{n}) (tn∣τn)的分布
只给定 τ n \tau_n τn时, x n x_n xn和 ε n \varepsilon_n εn都是随机变量,且都服从正态分布,故 ( t n ∣ τ n ) (t_{n} \mid \tau_{n}) (tn∣τn)也服从正态分布。
有 x n ∣ τ n ∼ N ( 0 , I / τ n ) , ε n ∣ τ n ∼ N ( 0 , σ 2 I / τ n ) x_{n}\left|\tau_{n} \sim N\left(0, I / \tau_{n}\right), \quad \varepsilon_{n}\right| \tau_{n} \sim N\left(0, \sigma^{2} I / \tau_{n}\right) xn∣τn∼N(0,I/τn),εn∣τn∼N(0,σ2I/τn)
E [ t n ∣ τ n ] = E [ W x n + μ + ϵ n ] = W E x n + μ + E ( ε n ) = μ \begin{aligned} E\left[t_{n} \mid \tau_{n}\right]=E\left[W x_{n}+\mu+\epsilon_{n}\right] &=W E x_{n}+\mu+E\left(\varepsilon_{n}\right) \\ &=\mu \end{aligned} E[tn∣τn]=E[Wxn+μ+ϵn]=WExn+μ+E(εn)=μ
c o v ( t n ∣ τ n ) = c o v ( W x n + μ + ε n ) = W W T + σ 2 I τ n cov(t_n \mid \tau_n) = cov(W x_n +\mu +\varepsilon_n) = \frac{WW^T+\sigma^2 I}{\tau_n} cov(tn∣τn)=cov(Wxn+μ+εn)=τnWWT+σ2I
故: ( t n ∣ τ n ) ∼ N ( μ , Σ / τ n ) (t_{n} \mid \tau_{n}) \sim N(\mu,\Sigma/\tau_n) (tn∣τn)∼N(μ,Σ/τn)
2.3 相关分布
由预备知识可知, τ n ∼ G a ( ν 2 , ν 2 ) \tau_n \sim Ga(\frac{\nu}{2},\frac{\nu}{2}) τn∼Ga(2ν,2ν), t n ∣ τ n ∼ N ( μ , Σ / τ n ) t_n | \tau_n \sim N(\mu,\Sigma/\tau_n) tn∣τn∼N(μ,Σ/τn), x n ∣ τ n ∼ N ( 0 , I / τ n ) x_n | \tau_n \sim N(0,I/\tau_n) xn∣τn∼N(0,I/τn), ε n ∣ τ n ∼ N ( 0 , σ 2 I / τ n ) \varepsilon_n | \tau_n \sim N(0,\sigma^2 I/\tau_n) εn∣τn∼N(0,σ2I/τn),由以上分布及t分布的性质,可知 t n ∼ t ( μ , Σ , ν ) t_n \sim t(\mu,\Sigma,\nu) tn∼t(μ,Σ,ν), x n ∼ t ( 0 , I , ν ) x_n \sim t(0,I,\nu) xn∼t(0,I,ν), ε n ∼ t ( 0 , σ 2 I , ν ) \varepsilon_n \sim t(0,\sigma^2 I,\nu) εn∼t(0,σ2I,ν)。
由此,我们可以看到在模型公式中 t n = W x n + μ n + ε n \mathbf{t}_n=\mathbf{W} \mathbf{x}_n+\boldsymbol{\mu_n}+\boldsymbol{\varepsilon_n} tn=Wxn+μn+εn,潜变量 x n x_n xn,噪声 ε n \varepsilon_n εn以及观测数据 t n t_n tn均服从t分布,因此ppca模型在t分布下的推广便是今天所要讨论的tppca模型。
3 tppca模型的参数估计
要计算模型的参数估计,主要的方法是极大似然估计。由上文可知,观测数据的分布为
t
n
∼
t
(
μ
,
Σ
,
ν
)
t_n \sim t(\mu,\Sigma,\nu)
tn∼t(μ,Σ,ν),因此对数似然函数
L
(
Θ
)
=
∑
n
=
1
N
log
(
p
(
t
n
∣
μ
,
Σ
,
ν
)
)
∝
∑
n
=
1
N
{
log
(
ν
+
d
2
)
−
1
2
log
(
∣
Σ
∣
)
+
ν
2
log
(
ν
)
−
log
(
Γ
(
ν
2
)
)
−
ν
+
d
2
log
(
ν
+
(
t
n
−
μ
)
T
Σ
−
1
(
t
n
−
μ
)
)
}
\begin{aligned} \textrm{L}(\Theta) &= \sum_{n=1}^{N} \log(p(t_n \mid \mu,\Sigma,\nu)) \\ &\propto \sum_{n=1}^{N} \{ \log(\frac{\nu+d}{2}) - \frac{1}{2} \log(|\Sigma|) + \frac{\nu}{2}\log(\nu) - \log(\Gamma(\frac{\nu}{2})) -\frac{\nu+d}{2}\log(\nu+(t_n-\mu)^{T}\Sigma^{-1}(t_n-\mu))\} \end{aligned}
L(Θ)=n=1∑Nlog(p(tn∣μ,Σ,ν))∝n=1∑N{log(2ν+d)−21log(∣Σ∣)+2νlog(ν)−log(Γ(2ν))−2ν+dlog(ν+(tn−μ)TΣ−1(tn−μ))}
上式中,由于其中
Σ
=
W
W
T
+
σ
2
I
\Sigma = WW^T+\sigma^2 I
Σ=WWT+σ2I,故参数空间
Θ
=
(
μ
,
W
,
σ
2
,
ν
)
\Theta = (\mu,W,\sigma^2,\nu)
Θ=(μ,W,σ2,ν),要得到上述参数的MLE,必须分别对每个参数求导,真实似然很难求导得到想要的参数估计,因此我们利用t分布的性质,引入潜在变量
τ
\tau
τ,利用EM类型的算法进行求解。
3.1 完全数据的对数似然函数
观测数据为
t
n
t_n
tn,潜在因子
x
n
x_n
xn,新引入的潜变量为
τ
n
\tau_n
τn,因此完全数据为
(
t
n
,
x
n
,
τ
n
)
(t_n,x_n,\tau_n)
(tn,xn,τn),参数向量为
Θ
=
(
μ
,
W
,
σ
2
,
ν
)
\Theta = (\mu,W,\sigma^2,\nu)
Θ=(μ,W,σ2,ν),因此完全数据的对数似然函数为:
L
c
(
Θ
)
=
∑
n
=
1
N
log
(
p
(
t
n
,
x
n
,
τ
n
)
)
=
∑
n
=
1
N
log
[
p
(
t
n
∣
x
n
,
τ
n
)
p
(
x
n
∣
τ
n
)
p
(
τ
n
)
]
=
∑
n
=
1
N
{
−
d
2
log
(
2
π
)
−
d
2
log
(
σ
2
)
+
d
2
log
(
τ
n
)
−
τ
n
2
σ
2
(
t
n
−
W
x
n
−
μ
)
T
(
t
n
−
W
x
n
−
μ
)
−
q
2
log
(
2
π
)
+
d
2
log
(
τ
n
)
−
t
n
2
x
n
T
x
n
+
ν
2
log
(
ν
2
)
−
log
(
Γ
(
ν
2
)
)
+
ν
2
(
log
(
τ
n
)
−
τ
n
)
−
log
(
t
n
)
}
(
去掉与参数无关项
)
∝
−
N
d
2
log
(
σ
2
)
−
1
2
σ
2
∑
n
=
1
N
{
τ
n
∥
t
n
−
μ
∥
2
−
2
τ
n
x
n
T
W
T
(
t
n
−
μ
)
+
t
r
[
W
τ
n
x
n
x
n
T
W
T
]
}
+
N
ν
2
log
(
ν
2
)
−
N
log
(
Γ
(
ν
2
)
)
+
ν
2
∑
n
=
1
N
(
log
τ
n
−
τ
n
)
=
L
(
θ
)
+
L
(
ν
)
\begin{aligned} L_c(\Theta) &= \sum_{n=1}^{N} \log(p(t_n,x_n,\tau_n)) \\ &= \sum_{n=1}^{N} \log[p(t_n|x_n,\tau_n)p(x_n|\tau_n)p(\tau_n)] \\ &=\sum_{n=1}^{N} \{-\frac{d}{2} \log(2\pi) - \frac{d}{2} \log(\sigma^2) + \frac{d}{2} \log(\tau_n) - \frac{\tau_n}{2\sigma^2}(t_n - W x_n - \mu)^{T}(t_n - W x_n - \mu) \\ &-\frac{q}{2} \log(2\pi) + \frac{d}{2} \log(\tau_n) - \frac{t_n} {2}x_n^{T}x_n \\ &+\frac{\nu}{2} \log(\frac{\nu}{2}) - \log(\Gamma(\frac{\nu}{2})) + \frac{\nu}{2}(\log(\tau_n)-\tau_n)-\log(t_n)\}(去掉与参数无关项)\\ &\propto-\frac{Nd}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{n=1}^{N}\{\tau_n\|t_n-\mu\|^2 -2\tau_n x_n^{T}W^{T}(t_n-\mu) + tr[W\tau_n x_n x_n^{T}W^T] \}\\ &+\frac{N\nu}{2}\log(\frac{\nu}{2}) - N\log(\Gamma(\frac{\nu}{2})) + \frac{\nu}{2} \sum_{n=1}^{N}(\log\tau_n - \tau_n)\\ &=L(\theta) +L(\nu) \end{aligned}
Lc(Θ)=n=1∑Nlog(p(tn,xn,τn))=n=1∑Nlog[p(tn∣xn,τn)p(xn∣τn)p(τn)]=n=1∑N{−2dlog(2π)−2dlog(σ2)+2dlog(τn)−2σ2τn(tn−Wxn−μ)T(tn−Wxn−μ)−2qlog(2π)+2dlog(τn)−2tnxnTxn+2νlog(2ν)−log(Γ(2ν))+2ν(log(τn)−τn)−log(tn)}(去掉与参数无关项)∝−2Ndlog(σ2)−2σ21n=1∑N{τn∥tn−μ∥2−2τnxnTWT(tn−μ)+tr[WτnxnxnTWT]}+2Nνlog(2ν)−Nlog(Γ(2ν))+2νn=1∑N(logτn−τn)=L(θ)+L(ν)
根据以上完全数据的似然函数,要想使用EM算法,在观测数据
t
n
t_n
tn下,
τ
n
\tau_n
τn和
x
n
x_n
xn的后验分布。
3.2 给定观测样本,两个潜变量的后验分布
3.2.1 更新 τ n \tau_n τn的分布
p ( τ n ∣ t n ) ∝ p ( t n ∣ τ n ) p ( τ n ) = ( 2 π ) − d 2 ∣ Σ ∣ − 1 2 τ d + ν 2 − 1 exp { − ( t n − μ ) T Σ − 1 ( t n − μ ) + ν 2 τ n } ∝ τ d + ν 2 − 1 exp { − ( t n − μ ) T Σ − 1 ( t n − μ ) + ν 2 τ n } = G a ( d + ν 2 , ( t n − μ ) T Σ − 1 ( t n − μ ) + ν 2 ) \begin{aligned} p(\tau_n|t_n) &\propto p(t_n|\tau_n)p(\tau_n) \\ &=(2\pi)^{-\frac{d}{2}}|\Sigma|^{-\frac{1}{2}}\tau^{\frac{d+\nu}{2}-1} \exp\{-\frac{(t_n-\mu)^{T}\Sigma^{-1}(t_n-\mu)+\nu}{2} \tau_n\} \\ &\propto \tau^{\frac{d+\nu}{2}-1} \exp\{-\frac{(t_n-\mu)^{T}\Sigma^{-1}(t_n-\mu)+\nu}{2} \tau_n\} \\ &=Ga(\frac{d+\nu}{2},\frac{(t_n-\mu)^{T}\Sigma^{-1}(t_n-\mu)+\nu}{2}) \end{aligned} p(τn∣tn)∝p(tn∣τn)p(τn)=(2π)−2d∣Σ∣−21τ2d+ν−1exp{−2(tn−μ)TΣ−1(tn−μ)+ντn}∝τ2d+ν−1exp{−2(tn−μ)TΣ−1(tn−μ)+ντn}=Ga(2d+ν,2(tn−μ)TΣ−1(tn−μ)+ν)
3.2.2 更新潜在因子 x n x_n xn的分布
p
(
x
n
∣
t
n
,
τ
n
)
∝
p
(
t
n
∣
x
n
,
τ
n
)
p
(
x
n
∣
τ
n
)
∝
exp
{
−
1
2
[
τ
n
x
n
T
(
W
t
W
+
σ
2
I
)
x
n
σ
2
−
2
τ
n
x
n
T
W
T
(
t
n
−
μ
)
σ
2
]
}
\begin{aligned} p(x_n|t_n,\tau_n) &\propto p(t_n|x_n,\tau_n)p(x_n|\tau_n) \\ &\propto \exp\{-\frac{1}{2}[\frac{\tau_n x_n^{T}(W^{t}W + \sigma^2 I)x_n}{\sigma^2}-\frac{2\tau_n x_n^{T}W^{T}(t_n-\mu)}{\sigma^2}]\} \end{aligned}
p(xn∣tn,τn)∝p(tn∣xn,τn)p(xn∣τn)∝exp{−21[σ2τnxnT(WtW+σ2I)xn−σ22τnxnTWT(tn−μ)]}
由于
t
n
∣
x
n
,
τ
n
t_n|x_n,\tau_n
tn∣xn,τn是正态分布,
x
n
∣
τ
n
x_n|\tau_n
xn∣τn也是正态分布,因此后验分布
x
n
∣
t
n
,
τ
n
x_n|t_n,\tau_n
xn∣tn,τn必然也是正态分布,假设
x
n
∣
t
n
,
τ
n
∼
N
(
m
,
Ψ
)
x_n|t_n,\tau_n \sim N(m,\Psi)
xn∣tn,τn∼N(m,Ψ),则有其概率密度函数的形式为:
p
(
x
n
∣
t
n
,
τ
n
)
∝
exp
{
−
1
2
(
x
n
T
Ψ
−
1
x
n
−
2
x
n
T
Ψ
−
1
m
)
}
\begin{aligned} p(x_n|t_n,\tau_n) &\propto \exp\{-\frac{1}{2}(x_n^{T}\Psi^{-1}x_n - 2 x_n^{T}\Psi^{-1}m)\} \end{aligned}
p(xn∣tn,τn)∝exp{−21(xnTΨ−1xn−2xnTΨ−1m)}
故由上式可得:
Ψ
−
1
=
τ
n
W
T
W
+
σ
2
I
σ
2
Ψ
=
τ
n
−
1
σ
2
M
−
1
=
τ
n
−
1
P
=
τ
n
−
1
[
I
−
W
T
Σ
−
1
W
]
m
=
W
T
Σ
−
1
(
t
n
−
μ
)
\begin{aligned} \Psi^{-1} = \tau_n \frac{W^{T}W + \sigma^2 I}{\sigma^2} \\ \Psi = \tau_n^{-1} \sigma^2M^{-1} = \tau_n^{-1} P = \tau_n^{-1} [I-W^T \Sigma^{-1}W] \\ m = W^T \Sigma^{-1}(t_n-\mu) \end{aligned}
Ψ−1=τnσ2WTW+σ2IΨ=τn−1σ2M−1=τn−1P=τn−1[I−WTΣ−1W]m=WTΣ−1(tn−μ)
3.3 Q函数及其最大化
Q函数是完全数据的对数似然关于两个潜变量 x n x_n xn和 τ n \tau_n τn的后验期望。
假设 Z ∼ G a ( α , β ) Z \sim Ga(\alpha,\beta) Z∼Ga(α,β),故 E ( Z ) = α / β E(Z) = \alpha/\beta E(Z)=α/β,因此, E ( τ n ∣ t n ) = d + ν ( t n − μ ) T Σ − 1 ( t n − μ ) + ν E(\tau_n|t_n)=\frac{d+\nu}{(t_n-\mu)^{T}\Sigma^{-1}(t_n-\mu)+\nu} E(τn∣tn)=(tn−μ)TΣ−1(tn−μ)+νd+ν
E ( log τ n ) = ψ ( d + ν 2 ) − log [ ( t n − μ ) T Σ − 1 ( t n − μ ) + ν 2 ] E(\log\tau_n) = \psi(\frac{d+\nu}{2}) - \log [\frac{(t_n-\mu)^{T}\Sigma^{-1}(t_n-\mu) + \nu}{2}] E(logτn)=ψ(2d+ν)−log[2(tn−μ)TΣ−1(tn−μ)+ν]
τ n \tau_n τn服从的是伽马分布,伽马分布属于指数族分布,而 log τ n \log\tau_n logτn是伽马分布的一个充分统计量,因此该公式的计算使用了指数组分布的充分统计量的期望,即 E ( T ) = b ( θ ) ˙ E(T)=\dot{b(\theta)} E(T)=b(θ)˙,利用此公式即可轻松计算 E ( log τ n ) E(\log\tau_n) E(logτn):
E ( x n ∣ t n , τ n ) = W T Σ − 1 ( t n − μ ) E(x_n|t_n,\tau_n) = W^{T}\Sigma^{-1}(t_n-\mu) E(xn∣tn,τn)=WTΣ−1(tn−μ)
E ( x n x n T ∣ t n , τ n ) = τ n − 1 σ 2 ( σ 2 I + W T W ) − 1 + E ( x n ∣ t n , τ n ) E ( x n T ∣ t n , τ n ) E(x_n x_n^{T}|t_n,\tau_n) = \tau_n^{-1} \sigma^2 (\sigma^2I + W^{T}W)^{-1} + E(x_n|t_n,\tau_n)E(x_n^{T}|t_n,\tau_n) E(xnxnT∣tn,τn)=τn−1σ2(σ2I+WTW)−1+E(xn∣tn,τn)E(xnT∣tn,τn)
E ( τ n x n T ∣ t n ) = E ( τ n E ( x n T ∣ t n ) ∣ t n ) = E ( τ n ∣ t n ) E ( x n T ∣ t n ) ( 因为 E ( x n T ∣ t n ) 是常数 ) E(\tau_n x_n^{T}|t_n) = E(\tau_n E(x_n^{T}|t_n)|t_n)=E(\tau_n|t_n)E(x_n^{T}|t_n) (因为E(x_n^{T}|t_n)是常数) E(τnxnT∣tn)=E(τnE(xnT∣tn)∣tn)=E(τn∣tn)E(xnT∣tn)(因为E(xnT∣tn)是常数)
E ( τ n x n x n T ∣ t n ) = E ( τ n E ( x n x n T ∣ t n , τ n ) ∣ t n ) = σ 2 M − 1 + E ( τ n ∣ t n ) E ( x n ∣ t n , τ n ) E ( x n T ∣ t n , τ n ) E(\tau_n x_n x_n^{T}|t_n) = E(\tau_n E(x_n x_n^{T}|t_n,\tau_n)|t_n)=\sigma^2 M^{-1} + E(\tau_n|t_n)E(x_n|t_n,\tau_n)E(x_n^{T}|t_n,\tau_n) E(τnxnxnT∣tn)=E(τnE(xnxnT∣tn,τn)∣tn)=σ2M−1+E(τn∣tn)E(xn∣tn,τn)E(xnT∣tn,τn)
其中, M = σ 2 I + W T W M = \sigma^{2} I +W^{T}W M=σ2I+WTW
根据上面的后验期望公式,完全数据的对数似然函数转化为Q函数为:
Q 1 ( Θ ) = Q 1 ( μ , W , σ 2 ) = − N d 2 log σ 2 − 1 2 σ 2 ∑ n = 1 N { E ( τ n ∣ t n ) ∥ t n − μ ∥ 2 − 2 E ( τ n x n T ∣ t n ) W T ( t n − μ ) + t r [ W E ( τ n x n x n T ) W T ] } \begin{aligned} Q_1(\Theta) &= Q_1(\mu,W,\sigma^2) \\ &=-\frac{Nd}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{n=1}^{N}\{E(\tau_n|t_n)\|t_n-\mu\|^{2} - 2E(\tau_n x_n^{T}|t_n)W^{T}(t_n-\mu) + tr[WE(\tau_n x_n x_n^{T})W^{T}]\} \end{aligned} Q1(Θ)=Q1(μ,W,σ2)=−2Ndlogσ2−2σ21n=1∑N{E(τn∣tn)∥tn−μ∥2−2E(τnxnT∣tn)WT(tn−μ)+tr[WE(τnxnxnT)WT]}
Q 2 ( ν ) = N ν 2 log ν 2 − N log Γ ( ν 2 ) + ν 2 ∑ n = 1 N E ( log τ n − τ n ) \begin{aligned} Q_2(\nu) = \frac{N\nu}{2}\log\frac{\nu}{2} - N\log\Gamma(\frac{\nu}{2}) + \frac{\nu}{2} \sum_{n=1}^{N}E(\log\tau_n - \tau_n) \end{aligned} Q2(ν)=2Nνlog2ν−NlogΓ(2ν)+2νn=1∑NE(logτn−τn)
对上述的 Q 1 Q_1 Q1和 Q 2 Q_2 Q2函数分别关于 ( μ , W , σ 2 ) (\mu,W,\sigma^2) (μ,W,σ2)和 ν \nu ν求导数:
∂ Q 1 ∂ μ = 1 2 σ 2 ∑ n = 1 N { 2 E ( τ n ∣ t n ) ( t n − μ ) − 2 W E ( τ n x n ∣ t n ) } = 0 \frac{\partial Q_1}{\partial \mu} = \frac{1}{2\sigma^2} \sum_{n=1}^{N}\{2E(\tau_n|t_n)(t_n-\mu) - 2WE(\tau_n x_n|t_n)\} = 0 ∂μ∂Q1=2σ21n=1∑N{2E(τn∣tn)(tn−μ)−2WE(τnxn∣tn)}=0
∑ n = 1 N E ( τ n ∣ t n ) ( t n − μ ) = W ∑ n = 1 N E ( τ n ∣ t n ) E ( x n ∣ τ n ) \sum_{n=1}^{N}E(\tau_n|t_n)(t_n-\mu) =W \sum_{n=1}^{N} E(\tau_n|t_n) E(x_n|\tau_n) n=1∑NE(τn∣tn)(tn−μ)=Wn=1∑NE(τn∣tn)E(xn∣τn)
μ ( t + 1 ) = ∑ n = 1 N E ( τ n ∣ t n ) [ t n − W ( t + 1 ) E ( x n ∣ t n ) ] ∑ n = 1 N E ( τ n ∣ t n ) \mu^{(t+1)} = \frac{\sum_{n=1}^{N} E(\tau_n|t_n) [t_n - W^{(t+1)} E(x_n|t_n)]}{\sum_{n=1}^{N} E(\tau_n|t_n)} μ(t+1)=∑n=1NE(τn∣tn)∑n=1NE(τn∣tn)[tn−W(t+1)E(xn∣tn)]
令 ρ = 1 N ∑ n = 1 N E ( τ n ∣ t n ) \rho = \frac{1}{N} \sum_{n=1}^{N}E(\tau_n|t_n) ρ=N1∑n=1NE(τn∣tn), μ ∗ = 1 N ρ ∑ n = 1 N E ( τ n ∣ t n ) t n \mu^{*} = \frac{1}{N \rho} \sum_{n=1}^{N} E(\tau_n|t_n)t_n μ∗=Nρ1∑n=1NE(τn∣tn)tn,则有 μ \mu μ的更新迭代公式为:
μ ( t + 1 ) = μ ( t ) + σ 2 ( t ) [ Σ ( t ) ] − 1 ( μ ∗ ( t ) − μ ( t ) ) \mu^{(t+1)} = \mu^{(t)} + \sigma^{2(t)}[\Sigma^{(t)}]^{-1} (\mu^{*(t)} - \mu^{(t)}) μ(t+1)=μ(t)+σ2(t)[Σ(t)]−1(μ∗(t)−μ(t))
求解W:
d
Q
1
∣
W
=
1
2
σ
2
∑
n
=
1
N
{
2
t
r
[
(
t
n
−
μ
)
E
(
τ
n
x
n
T
∣
t
n
)
d
W
T
]
−
2
t
r
[
W
E
(
τ
n
x
n
x
n
T
d
W
T
)
]
}
=
1
σ
2
t
r
{
∑
n
=
1
N
[
(
t
n
−
μ
)
E
(
τ
n
x
n
T
)
−
W
E
(
τ
n
x
n
x
n
T
)
]
d
W
T
}
\begin{aligned} dQ1_{|W} &= \frac{1}{2\sigma^2} \sum_{n=1}^{N} \{2tr[(t_n-\mu) E(\tau_n x_n^{T}|t_n) dW^{T}] -2tr[W E(\tau_n x_n x_n^{T} dW^{T})]\} \\ &=\frac{1}{\sigma^2} tr\{\sum_{n=1}^{N}[(t_n-\mu)E(\tau_n x_n^{T}) - W E(\tau_n x_n x_n^{T})] dW^T \} \end{aligned}
dQ1∣W=2σ21n=1∑N{2tr[(tn−μ)E(τnxnT∣tn)dWT]−2tr[WE(τnxnxnTdWT)]}=σ21tr{n=1∑N[(tn−μ)E(τnxnT)−WE(τnxnxnT)]dWT}
∂ Q 1 ∂ W = 1 σ 2 ∑ n = 1 N { ( t n − μ ) E ( τ n x n T ) − W E ( τ n x n x n T ) } = 0 \frac{\partial Q_1}{\partial W} = \frac{1}{\sigma^2} \sum_{n=1}^{N} \{(t_n-\mu)E(\tau_n x_n^{T}) - W E(\tau_n x_n x_n^{T}) \}=0 ∂W∂Q1=σ21n=1∑N{(tn−μ)E(τnxnT)−WE(τnxnxnT)}=0
W ( t + 1 ) = [ ∑ n = 1 N ( t n − μ ( t + 1 ) ) E ( τ n x n T ∣ t n ) ] × [ ∑ n = 1 N E ( τ n x n x n T ∣ t n ) ] − 1 W^{(t+1)} = [\sum_{n=1}^{N} (t_n - \mu^{(t+1)}) E(\tau_n x_n^{T}|t_n)] \times [\sum_{n=1}^{N} E(\tau_n x_n x_n^{T}|t_n)]^{-1} W(t+1)=[n=1∑N(tn−μ(t+1))E(τnxnT∣tn)]×[n=1∑NE(τnxnxnT∣tn)]−1
求解
σ
2
\sigma^2
σ2:
∂
Q
1
∂
σ
2
=
−
N
d
2
σ
2
+
1
σ
4
∑
n
=
1
N
{
E
(
τ
n
∣
t
n
)
∥
t
n
−
μ
∥
2
−
2
E
(
τ
n
x
n
T
∣
t
n
)
W
T
(
t
n
−
μ
)
+
t
r
[
W
E
(
τ
n
x
n
x
n
T
)
W
T
]
}
=
0
\frac{\partial Q_1}{\partial \sigma^2} = -\frac{Nd}{2\sigma^2} + \frac{1}{\sigma^4} \sum_{n=1}^{N} \{ E(\tau_n|t_n)\|t_n-\mu\|^{2} - 2E(\tau_n x_n^{T}|t_n)W^{T}(t_n-\mu) + tr[WE(\tau_n x_n x_n^{T})W^{T}] \} = 0
∂σ2∂Q1=−2σ2Nd+σ41n=1∑N{E(τn∣tn)∥tn−μ∥2−2E(τnxnT∣tn)WT(tn−μ)+tr[WE(τnxnxnT)WT]}=0
σ 2 ( t + 1 ) = 1 N d ∑ n = 1 N { E ( τ n ∣ t n ) ∥ t n − μ ( t + 1 ) ∥ 2 − 2 E ( τ n x n T ∣ t n ) W ( t + 1 ) T ( t n − μ ( t + 1 ) ) + t r [ W ( t + 1 ) E ( τ n x n x n T ) W ( t + 1 ) T ] } \sigma^{2(t+1)} = \frac{1}{Nd} \sum_{n=1}^{N} \{ E(\tau_n|t_n)\|t_n-\mu^{(t+1)} \|^{2} - 2E(\tau_n x_n^{T}|t_n)W^{(t+1)T}(t_n-\mu^{(t+1)}) + tr[W^{(t+1)} E(\tau_n x_n x_n^{T})W^{(t+1)T}] \} σ2(t+1)=Nd1n=1∑N{E(τn∣tn)∥tn−μ(t+1)∥2−2E(τnxnT∣tn)W(t+1)T(tn−μ(t+1))+tr[W(t+1)E(τnxnxnT)W(t+1)T]}
求解 ν \nu ν:
d Q 2 ( ν ) d ν = 1 2 ∑ n = 1 N [ 1 + log ν 2 − ψ ( ν 2 ) + E ( log τ n ∣ t n ) − E ( τ n ∣ t n ) ] = 0 \frac{dQ_2(\nu)}{d \nu} = \frac{1}{2} \sum_{n=1}^{N}[1 + \log\frac{\nu}{2} - \psi(\frac{\nu}{2}) + E(\log\tau_n|t_n) - E(\tau_n|t_n)]=0 dνdQ2(ν)=21n=1∑N[1+log2ν−ψ(2ν)+E(logτn∣tn)−E(τn∣tn)]=0
上述式子不能直接通过反解得到 ν \nu ν的数学表达式,因此需要对该式通过一元搜索得到最佳的零解,在此使用tppca论文中的近似解表达式:
y = − 1 N ∑ n = 1 N [ E ( log ( τ n ∣ t n ) − E ( τ n ∣ t n ) ] y = - \frac{1}{N} \sum_{n=1}^{N} [E(\log(\tau_n|t_n) - E(\tau_n|t_n)] y=−N1n=1∑N[E(log(τn∣tn)−E(τn∣tn)]
ν ( t + 1 ) = 2 y + log y − 1 + 0.0416 × [ 1 + erf ( 0.6594 log ( 2.1971 y + log y − 1 ) ) ] \nu^{(t+1)} = \frac{2}{y+\log y -1} + 0.0416 \times [ 1 + \textrm{erf}(0.6594\log(\frac{2.1971}{y + \log y -1 }))] ν(t+1)=y+logy−12+0.0416×[1+erf(0.6594log(y+logy−12.1971))]
其中, erf ( x ) = ( 2 π ) ∫ 0 x exp ( − t 2 ) d t \textrm{erf}(x) = (\frac{2}{\sqrt{\pi}})\int_{0}^{x} \exp(-t^2)\textrm{d}t erf(x)=(π2)∫0xexp(−t2)dt。