Q函数
Q ( θ , θ ( i ) ) = E Z [ log P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] = ∑ Z log P ( Y , Z ∣ θ ) ⋅ P ( Z ∣ Y , θ ( i ) ) \begin{aligned} Q\left(\theta, \theta^{(i)}\right) & =E_Z \left[\log P(Y, Z \mid \theta) \mid Y, \theta^{(i)}\right] \\ & =\sum_Z \log P(Y, Z \mid \theta) \cdot P\left(Z \mid Y, \theta^{(i)}\right) \end{aligned} Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]=Z∑logP(Y,Z∣θ)⋅P(Z∣Y,θ(i))
Q函数是EM算法中的一个重要函数,全称为“期望完全数据对数似然函数”。它的作用是在E步中计算出完全数据的对数似然函数的期望值,以便在M步中求出模型参数的最大似然估计值。
在之前的一篇文章(EM算法求解三硬币模型参数推导)中,为大家介绍了李航教授《统计学习方法》中求解三硬币模型的参数推导过程,其中使用的EM算法是从一个Q函数直接展开求解的,限于篇幅,文章并未展示证明过程,本篇文章作为上一篇文章以及《统计学习方法-第九章-179页》推导的补充,详细推导Q函数的由来。
Q函数推导证明
我们已知关于参数
θ
\theta
θ的似然函数
L
(
θ
)
=
log
P
(
Y
∣
θ
)
=
log
P
(
Y
,
θ
)
P
(
θ
)
=
log
∑
Z
P
(
Y
,
θ
,
Z
)
P
(
θ
)
=
log
∑
Z
P
(
Y
,
θ
,
Z
)
P
(
θ
)
=
log
∑
Z
P
(
Y
,
Z
∣
θ
)
=
log
∑
Z
P
(
Y
,
Z
,
θ
)
P
(
Z
,
θ
)
⋅
P
(
Z
,
θ
)
P
(
θ
)
L(\theta)=\log P(Y \mid \theta) \\ =\log \frac{P(Y, \theta)}{ P(\theta)} \\=\log \frac{ \sum_Z P(Y, \theta,Z)}{ P(\theta)} \\=\log \sum_Z \frac{ P(Y, \theta,Z)}{ P(\theta)}=\log \sum_Z P(Y, Z \mid \theta) \\=\log \sum_Z \frac{ P(Y, Z, \theta)}{ P(Z,\theta) }\cdot \frac{ P(Z, \theta)}{ P(\theta) }
L(θ)=logP(Y∣θ)=logP(θ)P(Y,θ)=logP(θ)∑ZP(Y,θ,Z)=logZ∑P(θ)P(Y,θ,Z)=logZ∑P(Y,Z∣θ)=logZ∑P(Z,θ)P(Y,Z,θ)⋅P(θ)P(Z,θ)
即
L
(
θ
)
=
log
∑
Z
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
L(\theta)=\log \sum_Z P(Y \mid Z, \theta) \cdot P(Z \mid \theta)
L(θ)=logZ∑P(Y∣Z,θ)⋅P(Z∣θ)
假设第i次参数取
θ
(
i
)
\theta^{(i)}
θ(i) ,我们希望优化后
L
(
θ
)
>
L
(
θ
(
i
)
)
L(\theta)>L(\theta^{(i)})
L(θ)>L(θ(i))
于是可以作差
即
L
(
θ
)
−
L
(
θ
(
i
)
)
=
log
Σ
Z
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
−
log
P
(
Y
∣
θ
(
i
)
)
L(\theta)-L\left(\theta^{(i)}\right)=\log \Sigma_Z P(Y \mid Z, \theta) \cdot P(Z \mid \theta)-\log P\left(Y \mid \theta^{(i)}\right)
L(θ)−L(θ(i))=logΣZP(Y∣Z,θ)⋅P(Z∣θ)−logP(Y∣θ(i))
第一项可以凑一个分式出来
L
(
θ
)
−
L
(
θ
(
i
)
)
=
log
(
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
)
−
log
P
(
Y
∣
θ
(
i
)
)
L(\theta)-L\left(\theta^{(i)}\right)=\log \left(\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}\right)-\log P\left(Y \mid \theta^{(i)}\right)
L(θ)−L(θ(i))=log(ΣZP(Z∣Y,θ(i))⋅P(Z∣Y,θ(i))P(Y∣Z,θ)⋅P(Z∣θ))−logP(Y∣θ(i))
利用
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
=
1
\sum_Z P \left(Z \mid Y, \theta^{(i)}\right)=1
∑ZP(Z∣Y,θ(i))=1的特性,第二项乘以这一串,可以得到
L
(
θ
)
−
L
(
θ
(
i
)
)
=
log
(
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
)
−
log
P
(
Y
∣
θ
(
i
)
)
⋅
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
L(\theta)-L\left(\theta^{(i)}\right)=\log \left(\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}\right)-\log P\left(Y \mid \theta^{(i)}\right) \cdot \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right)
L(θ)−L(θ(i))=log(ΣZP(Z∣Y,θ(i))⋅P(Z∣Y,θ(i))P(Y∣Z,θ)⋅P(Z∣θ))−logP(Y∣θ(i))⋅ΣZP(Z∣Y,θ(i))
利用
J
e
n
s
e
n
Jensen
Jensen不等式
l
o
g
∑
j
λ
j
⋅
y
j
⩾
∑
j
λ
j
⋅
l
o
g
y
j
log\sum_{j}\lambda_j \cdot y_j \geqslant \sum_j \lambda_j \cdot log y_j
log∑jλj⋅yj⩾∑jλj⋅logyj,其中
λ
⩾
0
,
∑
j
λ
j
=
1
\lambda \geqslant 0,\sum_j \lambda_j =1
λ⩾0,∑jλj=1
可知
⩾
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
−
log
P
(
Y
∣
θ
(
i
)
)
⋅
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
\geqslant \sum_Z P \left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)}\right)}-\log P\left(Y \mid \theta^{(i)}\right) \cdot \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right)
⩾Z∑P(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))P(Y∣Z,θ)⋅P(Z∣θ)−logP(Y∣θ(i))⋅ΣZP(Z∣Y,θ(i))
=
∑
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
[
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
p
(
Z
∣
Y
,
θ
(
i
)
)
−
log
P
(
Y
∣
θ
(
i
)
)
]
=\sum_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot\left[\log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{p\left(Z \mid Y, \theta^{(i)}\right)}-\log P\left(Y \mid \theta^{(i)}\right)\right]
=Z∑P(Z∣Y,θ(i))⋅[logp(Z∣Y,θ(i))P(Y∣Z,θ)⋅P(Z∣θ)−logP(Y∣θ(i))]
=
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
θ
(
i
)
)
=\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) }
=ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)
即此时
L
(
θ
)
−
L
(
θ
(
i
)
)
⩾
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
θ
(
i
)
)
L(\theta)-L\left(\theta^{(i)}\right) \geqslant \Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) }
L(θ)−L(θ(i))⩾ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)
即
L
(
θ
)
⩾
L
(
θ
(
i
)
)
+
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
θ
(
i
)
)
L(\theta) \geqslant L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) }
L(θ)⩾L(θ(i))+ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)
令
B
(
θ
,
θ
(
i
)
)
=
L
(
θ
(
i
)
)
+
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
θ
(
i
)
)
B\left(\theta, \theta^{(i)}\right)=L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{ P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{P\left(Z \mid Y, \theta^{(i)} \right) \cdot P\left(Y \mid \theta^{(i)} \right) }
B(θ,θ(i))=L(θ(i))+ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)
此时
B
(
θ
,
θ
(
i
)
)
B\left(\theta, \theta^{(i)}\right)
B(θ,θ(i)) 是
L
(
θ
)
L(\theta)
L(θ) 的下界,使
B
(
θ
,
θ
(
i
)
)
B\left(\theta, \theta^{(i)}\right)
B(θ,θ(i)) 最大化的
θ
\theta
θ 也可使
L
(
θ
)
L\left( \theta\right)
L(θ) 最大
于是我们的目标是
θ
(
i
+
1
)
=
argmax
θ
B
(
θ
,
θ
(
i
)
)
\theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} B\left(\theta, \theta^{(i)}\right)
θ(i+1)=θargmaxB(θ,θ(i))
也即
θ
(
i
+
1
)
=
argmax
θ
[
L
(
θ
(
i
)
)
+
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
θ
(
i
)
)
]
\theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} \left[L\left(\theta^{(i)}\right)+\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log \frac{P(Y \mid Z, \theta) \cdot P(Z \mid \theta)}{\left.P(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right)}\right]
θ(i+1)=θargmax[L(θ(i))+ΣZP(Z∣Y,θ(i))⋅logP(Z∣Y,θ(i))⋅P(Y∣θ(i))P(Y∣Z,θ)⋅P(Z∣θ)]
可把
L
(
θ
(
i
)
)
、
P
(
Z
∣
Y
,
θ
(
i
)
)
、
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
θ
(
i
)
)
L\left( \theta^{(i)}\right)、P\left(Z \mid Y, \theta^{(i)}\right)、 P\left(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right)
L(θ(i))、P(Z∣Y,θ(i))、P(Z∣Y,θ(i))⋅P(Y∣θ(i)) 三项视为常数
且已知
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
P
(
Y
∣
θ
(
i
)
)
>
0
P\left(Z \mid Y, \theta^{(i)}\right) \cdot P\left(Y \mid \theta^{(i)}\right)>0
P(Z∣Y,θ(i))⋅P(Y∣θ(i))>0 ,这一项从分母去掉,不影响求最大值,注意这里的
P
(
Z
∣
Y
,
θ
(
i
)
)
\left.P(Z \mid Y, \theta^{(i)}\right)
P(Z∣Y,θ(i)) 不能省略,因为它是
∑
\sum
∑ 后面中的每一项的系数
于是
θ
(
i
+
1
)
=
argmax
θ
[
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
]
\theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}}\left[\Sigma_Z P\left(Z \mid Y,{ \theta}^{(i)}\right) \cdot \log P(Y \mid Z, \theta) \cdot P(Z \mid \theta)\right]
θ(i+1)=θargmax[ΣZP(Z∣Y,θ(i))⋅logP(Y∣Z,θ)⋅P(Z∣θ)]
我们令
Q
(
θ
,
θ
(
i
)
)
=
Σ
Z
P
(
Z
∣
Y
,
θ
(
i
)
)
⋅
log
P
(
Y
∣
Z
,
θ
)
⋅
P
(
Z
∣
θ
)
Q\left(\theta, \theta^{(i)}\right)=\Sigma_Z P\left(Z \mid Y, \theta^{(i)}\right) \cdot \log P(Y \mid Z, \theta) \cdot P(Z \mid \theta)
Q(θ,θ(i))=ΣZP(Z∣Y,θ(i))⋅logP(Y∣Z,θ)⋅P(Z∣θ)
即
θ
(
i
+
1
)
=
argmax
θ
Q
(
θ
,
θ
(
i
+
1
)
)
\theta^{(i+1)}=\underset{\theta}{\operatorname{argmax}} Q\left(\theta, \theta^{(i+1)}\right)
θ(i+1)=θargmaxQ(θ,θ(i+1))
其中 Q ( θ , θ ( i ) ) Q\left(\theta, \theta^{(i)}\right) Q(θ,θ(i)) 就是所谓的 Q Q Q 函数
参考资料
[1].EM算法求解三硬币模型参数推导