EM算法(Expectation-Maximization),就是用最大似然MLE来递推求模型的参数。顾明思议分为两步:第一步求隐变量的期望,第二步找到让隐变量期望最大化的参数。用公式来表示最终的目标就是:
L(θ)=∑ilog(p(xi;θ))θ=arg maxθL(θ)L(\theta)=\sum_{i}log(p(x_i;\theta)) \\
\theta = \argmax_{\theta}L(\theta)
L(θ)=i∑log(p(xi;θ))θ=θargmaxL(θ)
而模型中有未知的隐变量z,那么
L(θ)=∑ilog∑z(p(xi,z;θ))L(\theta)=\sum_{i}log\sum_z(p(x_i,z;\theta))
L(θ)=i∑logz∑(p(xi,z;θ))
里面的求和实际上是希望求z得期望,假设z服从某种分布,它的概率是Qi(z)Q_i(z)Qi(z),它的取值分布是g(z)g(z)g(z),那么L(θ)L(\theta)L(θ)可以进一步变成:
L(θ)=∑ilogE(z)L(θ)=∑ilog∑zQi(z)g(z)L(θ)=∑ilog∑zQi(z)p(xi,z;θ)Qi(z)
L(\theta)=\sum_{i}logE(z) \\
L(\theta)=\sum_{i}log\sum_zQ_i(z)g(z) \\
L(\theta)=\sum_{i}log\sum_zQ_i(z)\frac{p(x_i,z;\theta)}{Q_i(z)}
L(θ)=i∑logE(z)L(θ)=i∑logz∑Qi(z)g(z)L(θ)=i∑logz∑Qi(z)Qi(z)p(xi,z;θ)
利用Jesson不等式,凸函数f(E(z))>=E(f(z))f(E(z))>=E(f(z))f(E(z))>=E(f(z)),相等的情况是E(z)=zE(z)=zE(z)=z,当E(z)E(z)E(z)是常数的时候。那么可以成功把log放进求和里面
L(θ)>=∑i∑zQi(z)logp(xi,z;θ)Qi(z)=J(z,θ)
L(\theta)>=\sum_{i}\sum_zQ_i(z)log\frac{p(x_i,z;\theta)}{Q_i(z)}=J(z,\theta)
L(θ)>=i∑z∑Qi(z)logQi(z)p(xi,z;θ)=J(z,θ)
所以,整个极大似然的概率L(θ)L(\theta)L(θ)有下界J(z,θ)J(z,\theta)J(z,θ),我们每次优化可以提供下界J(z,θ)J(z,\theta)J(z,θ),来不断提高L(θ)L(\theta)L(θ),也就是说L(θ)L(\theta)L(θ)是不断递增的,同时L(θ)L(\theta)L(θ)不超过1,这就是算法能收敛的原因。
还剩下一个问题Qi(z)Q_i(z)Qi(z)应该如何选择,如果Jesson不等式相等的条件就是
p(xi,z;θ)Qi(z)=c\frac{p(x_i,z;\theta)}{Q_i(z)}=cQi(z)p(xi,z;θ)=c
同时∑zQi(z)=1\sum_{z}Q_i(z)=1∑zQi(z)=1,c是一个常数,也就是说∑zp(xi,z;θ)\sum_z p(x_i,z;\theta)∑zp(xi,z;θ)是和z无关的,写成公式就是
∑zp(xi,z;θ)=p(xi;θ)
\sum_z p(x_i,z;\theta)=p(x_i;\theta)
z∑p(xi,z;θ)=p(xi;θ)
那么
Qi(z)=p(xi,z;θ)∑zp(xi,z;θ)Qi(z)=p(xi,z;θ)p(xi;θ)Qi(z)=p(z∣xi;θ)
Q_i(z)=\frac{p(x_i,z;\theta)}{\sum_zp(x_i,z;\theta)} \\
Q_i(z)=\frac{p(x_i,z;\theta)}{p(x_i;\theta)} \\
Q_i(z)=p(z|x_i;\theta)
Qi(z)=∑zp(xi,z;θ)p(xi,z;θ)Qi(z)=p(xi;θ)p(xi,z;θ)Qi(z)=p(z∣xi;θ)
这也解决了Qi(z)Q_i(z)Qi(z)如何选择的问题,刚好的已知参数和数据情况下的后验概率。
所以,EM算法用公式表达就是:
- E步:算隐含变量的期望,隐含变量的概率分布是Qi(z)=p(z∣xi;θ)Q_i(z)=p(z|x_i;\theta)Qi(z)=p(z∣xi;θ)
而期望的下界是
J(z,θ)=∑i∑zQi(z)logp(xi,z;θ)Qi(z) J(z,\theta)=\sum_{i}\sum_zQ_i(z)log\frac{p(x_i,z;\theta)}{Q_i(z)} J(z,θ)=i∑z∑Qi(z)logQi(z)p(xi,z;θ) - M步:找到隐含变量期望最大化的θ\thetaθ进行下一轮迭代,期望下界最大化就是期望最大化
θ=arg maxθJ(z,θ) \theta = \argmax_{\theta}J(z,\theta) θ=θargmaxJ(z,θ)