EM算法的引入
概率模型有时既含有观测变量,又含有隐变量或潜在变量。所以不能直接用极大似然估计去估计参数。EM算法就是对含有隐变量模型的参数的极大似然估计算法。
EM算法
一般用YYY表示观测随机变量的数据,ZZZ表示隐随机变量的数据,YYY和ZZZ连起来称为完全数据,YYY称为不完全数据。假设给定观测数据YYY,其概率分布P(Y∣θ)P(Y|\theta)P(Y∣θ),其中θ\thetaθ为参数。那么不完全数据YYY的似然函数是P(Y∣θ)P(Y|\theta)P(Y∣θ),其对数似然函数是L(θ)=logP(Y∣θ)L(\theta)=\log P(Y|\theta)L(θ)=logP(Y∣θ),假设YYY和ZZZ的联合改论分布是P(Y,Z∣θ)P(Y,Z|\theta)P(Y,Z∣θ),那么完全数据的对数似然函数是L(θ)=P(Y,Z∣θ)L(\theta)=P(Y,Z|\theta)L(θ)=P(Y,Z∣θ)
EMEMEM算法基本思路是先求期望MMM再进一步最大化,似然函数
算法:
输入:观测变量YYY,隐变量数据ZZZ,联合分布P(Y,Z∣θ)P(Y,Z|\theta)P(Y,Z∣θ),条件分布P(Z∣Y,θ)P(Z|Y,\theta)P(Z∣Y,θ)
输出:模型参数θ\thetaθ
(1)(1)(1)选择参数的初始值θ(0),\theta^{(0)},θ(0),开始迭代
(2)(2)(2)EEE步:记θ(i)\theta^{(i)}θ(i),为第iii次迭代的参数估计值,在第i+1i+1i+1次迭代的EEE计算
Q(θ,θ(i))=Ez[logP(Y,Z∣θ)∣Y,θ(i)]Q(\theta,\theta^{(i)})=E_z[\log P(Y,Z|\theta)|Y,\theta^{(i)}]Q(θ,θ(i))=Ez[logP(Y,Z∣θ)∣Y,θ(i)]
=∑ZlogP(Y,Z∣θ)P(Z∣Y,θ(i))=\sum\limits_{Z}\log P(Y,Z|\theta)P(Z|Y,\theta^{(i)})=Z∑logP(Y,Z∣θ)P(Z∣Y,θ(i))
(3)(3)(3)MMM步:求使Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))极大化的θ\thetaθ,确定第i+1i+1i+1次迭代的参数估计值θ(i+1)\theta^{(i+1)}θ(i+1)
θ(i+1)=arg maxθQ(θ,θ(i))\theta^{(i+1)}=\argmax\limits_{\theta}Q(\theta,\theta^{(i)})θ(i+1)=θargmaxQ(θ,θ(i))
(4)(4)(4)重复(2),(3)(2),(3)(2),(3)直到收敛
注意,定义Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]Q(\theta,\theta^{(i)})=E_Z[\log P(Y,Z|\theta)|Y,\theta^{(i)}]Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]
当∣∣θi+1−θi∣∣<ϵ1 or ∣∣Q(θ(i+1),θ(i))−Q(θ(i),θ(i))∣∣<ϵ2||\theta^{i+1}-\theta^{i}|| <\epsilon_1 \ \ \ \ or \ \ \ \ ||Q(\theta^{(i+1)},\theta^{(i)})-Q(\theta^{(i)},\theta^{(i)})||<\epsilon_2∣∣θi+1−θi∣∣<ϵ1 or ∣∣Q(θ(i+1),θ(i))−Q(θ(i),θ(i))∣∣<ϵ2
算法停止
EM算法的导出
对数似然函数为
L(θ)=logP(Y∣θ)=log∑ZP(Y,Z∣θ)L(\theta)=\log P(Y|\theta)=\log \sum\limits_Z P(Y,Z|\theta)L(θ)=logP(Y∣θ)=logZ∑P(Y,Z∣θ)
=log(∑P(Y∣Z,θ)P(Z∣θ))=\log(\sum\limits_P(Y|Z,\theta)P(Z|\theta))=log(P∑(Y∣Z,θ)P(Z∣θ))
我们希望新值L(θ)>L(θ(i))L(\theta)>L(\theta^{(i)})L(θ)>L(θ(i))于是
L(θ)−L(θ(i))=log(∑ZP(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))L(\theta)-L(\theta^{(i)})=\log (\sum\limits_ZP(Y|Z,\theta)P(Z|\theta))-\log P(Y|\theta^{(i)})L(θ)−L(θ(i))=log(Z∑P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))
利用JensenJensenJensen不等式得
L(θ)−L(θ(i))=log(∑ZP(Z∣Y,θi)P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θi))−logP(Y∣θ(i))L(\theta)-L(\theta^{(i)})=\log(\sum\limits_ZP(Z|Y,\theta^{i})\frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{i})})-\log P(Y|\theta^{(i)})L(θ)−L(θ(i))=log(Z∑P(Z∣Y,θi)P(Z∣Y,θi)P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))
≥∑ZP(Z∣Y,θ(i))logP(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θi)−logP(Y∣θ(i))\ge \sum\limits_ZP(Z|Y,\theta^{(i)})\log \frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{i})}-\log P(Y|\theta^{(i)})≥Z∑P(Z∣Y,θ(i))logP(Z∣Y,θi)P(Y∣Z,θ)P(Z∣θ)−logP(Y∣θ(i))
=∑ZP(Z∣Y,θ(i))logP(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i))=\sum\limits_ZP(Z|Y,\theta^{(i)})\log \frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})}=Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)
令
B(θ,θ(i))=L(θ(i))+∑ZP(Z∣Y,θ(i))logP(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i))B(\theta,\theta^{(i)})=L(\theta^{(i)})+\sum\limits_ZP(Z|Y,\theta^{(i)})\log \frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})}B(θ,θ(i))=L(θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ)
则
L(θ)≥B(θ,θ(i))L(\theta)\ge B(\theta,\theta^{(i)})L(θ)≥B(θ,θ(i))
又
L(θ(i))=B(θ(i),θ(i))L(\theta^{(i)})= B(\theta^{(i)},\theta^{(i)})L(θ(i))=B(θ(i),θ(i))
因此我们可以使B(θ,θ(i))B(\theta,\theta^{(i)})B(θ,θ(i))增大
θ(i+1)=arg maxθB(θ,θ(i))\theta^{(i+1)}=\argmax\limits_{\theta}B(\theta,\theta^{(i)})θ(i+1)=θargmaxB(θ,θ(i))
θ(i+1)=arg maxθ(L(θ(i))+∑ZP(Z∣Y,θ(i))logP(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i)))\theta^{(i+1)}=\argmax\limits_{\theta}(L(\theta^{(i)})+\sum\limits_ZP(Z|Y,\theta^{(i)})\log \frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})θ(i+1)=θargmax(L(θ(i))+Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))
=arg maxθ(∑ZP(Z∣Y,θ(i))logP(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i)))=\argmax\limits_{\theta}(\sum\limits_ZP(Z|Y,\theta^{(i)})\log \frac{P(Y|Z,\theta)P(Z|\theta)}{P(Z|Y,\theta^{(i)})P(Y|\theta^{(i)})})=θargmax(Z∑P(Z∣Y,θ(i))logP(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))
=arg maxθ(∑ZP(Z∣Y,θ(i))logP(Y,Z∣θ))=\argmax\limits_{\theta}(\sum\limits_ZP(Z|Y,\theta^{(i)})\log P(Y,Z|\theta))=θargmax(Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ))
=arg maxθQ(θ,θ(i))=\argmax\limits_{\theta}Q(\theta,\theta^{(i)})=θargmaxQ(θ,θ(i))
即EMEMEM算法是对极大似然得逼近
EM算法在无监督学习中的应用
对于训练数据T={(x1,y1),(x2,y2),...,(xN,yN)}T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}T={(x1,y1),(x2,y2),...,(xN,yN)},我们可以把xix_ixi看成观测变量yiy_iyi看成隐变量,这样就可以利用算法来估计参数
EM算法的收敛性
设P(Y∣θ)P(Y|\theta)P(Y∣θ)为观测数据的似然函数,θi(i=1,2,...)\theta^{i}(i=1,2,...)θi(i=1,2,...)为EMEMEM算法得到的参数估计,P(Y∣θ(i))(i=1,2,...)P(Y|\theta^{(i)})(i=1,2,...)P(Y∣θ(i))(i=1,2,...)为对应似然函数序列,则P(Y∣θ(i))P(Y|\theta^{(i)})P(Y∣θ(i))是单调递增的,即
P(Y∣θi+1)≥P(Y∣θi)P(Y|\theta^{i+1})\ge P(Y|\theta^{i})P(Y∣θi+1)≥P(Y∣θi)
证明
P(Y∣θ)≥P(Y,Z∣θ)P(Z∣Y,θ)P(Y|\theta)\ge \frac{P(Y,Z|\theta)}{P(Z|Y,\theta)}P(Y∣θ)≥P(Z∣Y,θ)P(Y,Z∣θ)
logP(Y∣θ)=logP(Y,Z∣θ)−logP(Z∣Y,θ)\log P(Y|\theta)=\log P(Y,Z|\theta)-\log P(Z|Y,\theta)logP(Y∣θ)=logP(Y,Z∣θ)−logP(Z∣Y,θ)
Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]Q(\theta,\theta^{(i)})=E_Z[\log P(Y,Z|\theta)|Y,\theta^{(i)}]Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]
H(θ,θ(i))=∑ZlogP(Z∣Y,θ)P(Z∣Y,θ(i))H(\theta,\theta^{(i)})=\sum\limits_Z\log P(Z|Y,\theta)P(Z|Y,\theta^{(i)})H(θ,θ(i))=Z∑logP(Z∣Y,θ)P(Z∣Y,θ(i))
于是对数似然函数可写成
logP(Y∣θ)=Q(θ,θ(i))−H(θ,θ(i))\log P(Y|\theta)=Q(\theta,\theta^{(i)})-H(\theta,\theta^{(i)})logP(Y∣θ)=Q(θ,θ(i))−H(θ,θ(i))
取
logP(Y∣θ(i+1))−logP(Y∣θ(i))\log P(Y|\theta^{(i+1)})-\log P(Y|\theta^{(i)})logP(Y∣θ(i+1))−logP(Y∣θ(i))
=[Q(θ(i+1),θ(i))−Q(θ(i),θ(i))]−[H(θ(i+1),θ(i))−H(θ(i),θ(i))]=[Q(\theta^{(i+1)},\theta^{(i)})-Q(\theta^{(i)},\theta^{(i)})]-[H(\theta^{(i+1)},\theta^{(i)})-H(\theta^{(i)},\theta^{(i)})]=[Q(θ(i+1),θ(i))−Q(θ(i),θ(i))]−[H(θ(i+1),θ(i))−H(θ(i),θ(i))]
由极大定义
[Q(θ(i+1),θ(i))−Q(θ(i),θ(i))]≥0[Q(\theta^{(i+1)},\theta^{(i)})-Q(\theta^{(i)},\theta^{(i)})]\ge 0[Q(θ(i+1),θ(i))−Q(θ(i),θ(i))]≥0
[H(θ(i+1),θ(i))−H(θ(i),θ(i))]=∑Z(logP(Z∣Y,θ(i+1))P(Z∣Y,θ(i)))P(Z∣Y,θ(i))[H(\theta^{(i+1)},\theta^{(i)})-H(\theta^{(i)},\theta^{(i)})]=\sum\limits_Z(\log \frac{P(Z|Y,\theta^{(i+1)})}{P(Z|Y,\theta^{(i)})})P(Z|Y,\theta^{(i)})[H(θ(i+1),θ(i))−H(θ(i),θ(i))]=Z∑(logP(Z∣Y,θ(i))P(Z∣Y,θ(i+1)))P(Z∣Y,θ(i))
≤log(∑Z)P(Z∣Y,θ(i+1))P(Z∣Y,θ(i))P(Z∣Y,θ(i))=0\le \log(\sum\limits_Z) \frac{P(Z|Y,\theta^{(i+1)})}{P(Z|Y,\theta^{(i)})}P(Z|Y,\theta^{(i)})=0≤log(Z∑)P(Z∣Y,θ(i))P(Z∣Y,θ(i+1))P(Z∣Y,θ(i))=0
最终得证
设L(θ)=logP(Y∣θ)L(\theta)=\log P(Y|\theta)L(θ)=logP(Y∣θ)为观测数据得对数似然函数,θ(i),i=1,2,...\theta^{(i)},i=1,2,...θ(i),i=1,2,...为EM算法得到的参数序列,L(θ(i))L(\theta^{(i)})L(θ(i))为对应的对数似然函数序列,
则
(1)如果P(Y∣X)P(Y|X)P(Y∣X)有上界,则L(θ(i))L(\theta^{(i)})L(θ(i))收敛到某一值L∗L^*L∗
(2)在函数QQQ与LLL满足一定条件下,EM算法得到收敛的θ∗\theta^*θ∗是稳定点
EM算法在高斯混合模型学习中的应用
高斯混合模型
- 定义高斯混合模型是指具有如下形式的概率分布模型
P(y∣θ)=∑k=1Kakϕ(y∣θk)P(y|\theta)=\sum\limits_{k=1}^Ka_k\phi(y|\theta_k)P(y∣θ)=k=1∑Kakϕ(y∣θk)
其中ak≥0a_k\ge0ak≥0是系数,∑k=1Kak=1,ϕ(y∣θk)\sum\limits_{k=1}^Ka_k=1,\phi(y|\theta_k)k=1∑Kak=1,ϕ(y∣θk)是高斯分布密度,θk(μk,σk2)\theta_k(\mu_k,\sigma_k^2)θk(μk,σk2)
ϕ(y∣θk)=12πσkexp(−(y−μk)22σk2)\phi(y|\theta_k)=\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(y-\mu_k)^2}{2\sigma_k^2})ϕ(y∣θk)=2πσk1exp(−2σk2(y−μk)2)
称为第k个模型
高斯混合模型参数估计的EM算法
- 推导算法
- 明确隐变量,写出完全数据的对数似然函数
γjk={1第j个观测来自第k个分量模型0其他\gamma_{jk}=\begin{cases} 1 & 第j个观测来自第k个分量模型\\ 0 & 其他\\ \end{cases}γjk={10第j个观测来自第k个分量模型其他
j=1,2,...,N;k=1,2,...,Kj=1,2,...,N;k=1,2,...,Kj=1,2,...,N;k=1,2,...,K
于是似然函数
P(y,γ∣θ)=∏k=1K∏j=1N[akϕ(yi∣θk)]γjkP(y,\gamma|\theta)=\prod\limits_{k=1}^K\prod\limits_{j=1}^N[a_k\phi(y_i|\theta_k)]^{\gamma_{jk}}P(y,γ∣θ)=k=1∏Kj=1∏N[akϕ(yi∣θk)]γjk
=∏k=1Kank∏j=1N[12πσkexp(−(yj−μk)22σk2)]=\prod\limits_{k=1}^Ka^{n_k}\prod\limits_{j=1}^N[\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(y_j-\mu_k)^2}{2\sigma_k^2})]=k=1∏Kankj=1∏N[2πσk1exp(−2σk2(yj−μk)2)]
nk=∑j=1Nγjk,∑k=1Knk=Nn_k=\sum\limits_{j=1}^N\gamma_{jk},\sum\limits_{k=1}^Kn_k=Nnk=j=1∑Nγjk,k=1∑Knk=N
logP(y,γ∣θ)=∑k=1K{nklogak+∑j=1Nγjk[log(12π)−logσk−12σk2(yj−μk)2]}\log P(y,\gamma|\theta)=\sum\limits_{k=1}^K\{n_k\log a_k +\sum\limits_{j=1}^N\gamma_{jk}[\log (\frac{1}{\sqrt{2\pi}})-\log \sigma_k-\frac{1}{2\sigma_k^2}(y_j-\mu_k)^2] \}logP(y,γ∣θ)=k=1∑K{nklogak+j=1∑Nγjk[log(2π1)−logσk−2σk21(yj−μk)2]} - EM算法的E,计算Q
Q(θ,θ(i))=E[logP(y,γ∣θ)∣y,θ(i)]Q(\theta,\theta^{(i)})=E[\log P(y,\gamma|\theta)|y,\theta^{(i)}]Q(θ,θ(i))=E[logP(y,γ∣θ)∣y,θ(i)]
=E{∑k=1K{nklogak+∑j=1Nγjk[log(12π)−logσk−12σk2(yj−μk)2]}}=E\{\sum\limits_{k=1}^K\{n_k\log a_k +\sum\limits_{j=1}^N\gamma_{jk}[\log (\frac{1}{\sqrt{2\pi}})-\log \sigma_k-\frac{1}{2\sigma_k^2}(y_j-\mu_k)^2] \}\}=E{k=1∑K{nklogak+j=1∑Nγjk[log(2π1)−logσk−2σk21(yj−μk)2]}}
∑k=1K{∑j=1N(γjk)logak+∑j=1NE(γjk)[log(12π)−logσk−12σk2(yj−μk)2]}\sum\limits_{k=1}^K\{\sum\limits_{j=1}^N(\gamma_{jk})\log a_k +\sum\limits_{j=1}^NE(\gamma_{jk})[\log (\frac{1}{\sqrt{2\pi}})-\log \sigma_k-\frac{1}{2\sigma_k^2}(y_j-\mu_k)^2] \}k=1∑K{j=1∑N(γjk)logak+j=1∑NE(γjk)[log(2π1)−logσk−2σk21(yj−μk)2]}
E(γjk∣y,θ)=P(γjk=1∣y,θ)E(\gamma_{jk}|y,\theta)=P(\gamma_{jk}=1|y,\theta)E(γjk∣y,θ)=P(γjk=1∣y,θ)
=P(γjk=1∣y,θ)∑k=1KP(γjk=1∣y,θ)=\frac{P(\gamma_{jk}=1|y,\theta)}{\sum\limits_{k=1}^KP(\gamma_{jk}=1|y,\theta)}=k=1∑KP(γjk=1∣y,θ)P(γjk=1∣y,θ)
=P(yj∣γjk=1,θ)P(γjk=1∣θ)∑k=1KP(yj∣γjk=1,θ)P(γjk=1∣θ)=\frac{P(y_j|\gamma_{jk}=1,\theta)P(\gamma_{jk}=1|\theta)}{\sum\limits_{k=1}^KP(y_j|\gamma_{jk}=1,\theta)P(\gamma_{jk}=1|\theta)}=k=1∑KP(yj∣γjk=1,θ)P(γjk=1∣θ)P(yj∣γjk=1,θ)P(γjk=1∣θ)
akϕ(yi∣θk)∑k=1Kakϕ(yi∣θk)\frac{a_k\phi(y_i|\theta_k)}{\sum\limits_{k=1}^Ka_k\phi(y_i|\theta_k)}k=1∑Kakϕ(yi∣θk)akϕ(yi∣θk)
记γjk^=E(γjk∣y,θ)\hat{\gamma_{jk}}=E(\gamma_{jk}|y,\theta)γjk^=E(γjk∣y,θ)
Q(θ,θ(i))=∑k=1K{nklogak+∑j=1Nγjk^[log(12π)−logσk−12σk2(yj−μk)2]}Q(\theta,\theta^{(i)})=\sum\limits_{k=1}^K\{n_k\log a_k +\sum\limits_{j=1}^N\hat{\gamma_{jk}}[\log (\frac{1}{\sqrt{2\pi}})-\log \sigma_k-\frac{1}{2\sigma_k^2}(y_j-\mu_k)^2] \}Q(θ,θ(i))=k=1∑K{nklogak+j=1∑Nγjk^[log(2π1)−logσk−2σk21(yj−μk)2]} - EM算法的M,求解极大θ(i+1)=arg maxθQ(θ,θ(i))\theta^{(i+1)}=\argmax\limits_{\theta}Q(\theta,\theta^{(i)})θ(i+1)=θargmaxQ(θ,θ(i))
对参数求导等于0得
μ^k=∑j=1Nγ^jkyj∑j=1Nγ^jk, k=1,2,...,K\hat\mu_k=\frac{\sum\limits_{j=1}^N\hat\gamma_{jk}y_j}{\sum\limits_{j=1}^N\hat\gamma_{jk}},\ \ \ \ k=1,2,...,Kμ^k=j=1∑Nγ^jkj=1∑Nγ^jkyj, k=1,2,...,K
σ^k2=∑j=1Nγ^jk(yj−μk)∑j=1Nγ^jk, k=1,2,...,K\hat\sigma_k^2=\frac{\sum\limits_{j=1}^N\hat\gamma_{jk}(y_j-\mu_k)}{\sum\limits_{j=1}^N\hat\gamma_{jk}},\ \ \ \ k=1,2,...,Kσ^k2=j=1∑Nγ^jkj=1∑Nγ^jk(yj−μk), k=1,2,...,K
a^knkN=∑j=1Nγ^jkN, k=1,2,...,K\hat a_k\frac{n_k}{N}=\frac{\sum\limits_{j=1}^N\hat\gamma_{jk}}{N},\ \ \ \ k=1,2,...,Ka^kNnk=Nj=1∑Nγ^jk, k=1,2,...,K
算法
输入:观测数据y1,y2,...,yNy_1,y_2,...,y_Ny1,y2,...,yN,高斯混合模型
输出:高斯混合模型参数
(1)(1)(1)取参数得初始值开始迭代
(2)(2)(2)E步:依据当前模型参数,计算分模型得响应度
γ^jk=akϕ(yi∣θk)∑k=1Kakϕ(yi∣θk)\hat\gamma_{jk}=\frac{a_k\phi(y_i|\theta_k)}{\sum\limits_{k=1}^Ka_k\phi(y_i|\theta_k)}γ^jk=k=1∑Kakϕ(yi∣θk)akϕ(yi∣θk)
(3)(3)(3)M步:
μ^k=∑j=1Nγ^jkyj∑j=1Nγ^jk, k=1,2,...,K\hat\mu_k=\frac{\sum\limits_{j=1}^N\hat\gamma_{jk}y_j}{\sum\limits_{j=1}^N\hat\gamma_{jk}},\ \ \ \ k=1,2,...,Kμ^k=j=1∑Nγ^jkj=1∑Nγ^jkyj, k=1,2,...,K
σ^k2=∑j=1Nγ^jk(yj−μk)∑j=1Nγ^jk, k=1,2,...,K\hat\sigma_k^2=\frac{\sum\limits_{j=1}^N\hat\gamma_{jk}(y_j-\mu_k)}{\sum\limits_{j=1}^N\hat\gamma_{jk}},\ \ \ \ k=1,2,...,Kσ^k2=j=1∑Nγ^jkj=1∑Nγ^jk(yj−μk), k=1,2,...,K
a^knkN=∑j=1Nγ^jkN, k=1,2,...,K\hat a_k\frac{n_k}{N}=\frac{\sum\limits_{j=1}^N\hat\gamma_{jk}}{N},\ \ \ \ k=1,2,...,Ka^kNnk=Nj=1∑Nγ^jk, k=1,2,...,K
(4)(4)(4)重复第(2)(2)(2)和第(3)(3)(3)步直到收敛
EM算法的推广
F函数的极大-极大算法
- 定义假设隐变量数据ZZZ的概率分布为P^(Z)\hat P(Z)P^(Z),定义分布P^\hat PP^与参数θ\thetaθ的函数F(P^,θ)F(\hat P,\theta)F(P^,θ)如下:
F(P^,θ)=EP^[logP(Y,Z∣θ)]+H(P^)F(\hat P,\theta)=E_{\hat P}[\log P(Y,Z|\theta)]+H(\hat P)F(P^,θ)=EP^[logP(Y,Z∣θ)]+H(P^)
称为FFF函数,其中H(P^)=−EP^logP^(Z)H(\hat P)=-E_{\hat P\log \hat P(Z)}H(P^)=−EP^logP^(Z)是分布P^(Z)\hat P(Z)P^(Z)的熵
对于固定的θ\thetaθ存在唯一的分布P^θ\hat P_\thetaP^θ极大化F(P^,θ)F(\hat P,\theta)F(P^,θ),这时P^θ由下式给出\hat P_\theta由下式给出P^θ由下式给出
P^θ(Z)=P(Z∣Y,θ)\hat P_\theta(Z)=P(Z|Y,\theta)P^θ(Z)=P(Z∣Y,θ)并且P^θ\hat P_\thetaP^θ随θ\thetaθ连续变化
证明
拉格朗日函数为
L=EP^logP(Y,Z∣θ)−EP^logP^(Z)+λ(1−∑ZP^(Z))L=E_{\hat P}\log P(Y,Z|\theta)-E_{\hat P}\log \hat P(Z)+\lambda(1-\sum\limits_Z \hat P(Z))L=EP^logP(Y,Z∣θ)−EP^logP^(Z)+λ(1−Z∑P^(Z))
∂L∂P^(Z)=logP(Y,Z∣θ)−logP^(Z)−1−λ=0\frac{\partial L}{\partial \hat P(Z)}=\log P(Y,Z|\theta)-\log \hat P(Z)-1-\lambda=0∂P^(Z)∂L=logP(Y,Z∣θ)−logP^(Z)−1−λ=0
得
λ=logP(Y,Z∣θ)−logP^θ(Z)−1\lambda=\log P(Y,Z|\theta)-\log \hat P_\theta(Z)-1λ=logP(Y,Z∣θ)−logP^θ(Z)−1
最终
P^θ(Z)=P(Z∣Y,θ)\hat P_\theta(Z)=P(Z|Y,\theta)P^θ(Z)=P(Z∣Y,θ)
设L(θ)=logP(Y∣θ)L(\theta)=\log P(Y|\theta)L(θ)=logP(Y∣θ) 为观测数据得似然函数,θ(i)\theta^{(i)}θ(i)为EMEMEM算法得到得参数估计,如果F(P^,θ)F(\hat P,\theta)F(P^,θ)在θ∗\theta^*θ∗由局部极大\最大,则在LLL上也是局部极大\最大
EM算法得一次迭代可由FFF函数得极大-极大算法实现
(1)(1)(1)对于固定得θ(i)\theta^{(i)}θ(i),求P^(i+1)\hat P^{(i+1)}P^(i+1)使F(P^,θ(i))F(\hat P,\theta^{(i)})F(P^,θ(i))极大
(2)(2)(2)对于固定P^(i+1)\hat P^{(i+1)}P^(i+1)求θ(i+1)\theta^{(i+1)}θ(i+1)使F(P^(i+1),θ)F(\hat P^{(i+1),\theta})F(P^(i+1),θ)极大化
GEM算法
算法1
输入:观测数据,FFF函数
输出:模型参数
(1)(1)(1)初始化参数θ(0)\theta^{(0)}θ(0), 开始迭代
(2)(2)(2)固定θ\thetaθ最大化PPP
(3)(3)(3)得到PPP后优化θ\thetaθ
(4)(4)(4)重复(2),(3)(2),(3)(2),(3)
算法2
输入:观测数据,QQQ函数
输出:模型参数
(1)(1)(1)初始化参数θ(0)\theta^{(0)}θ(0), 开始迭代
(2)(2)(2)
Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]Q(\theta,\theta^{(i)})=E_Z[\log P(Y,Z|\theta)|Y,\theta^{(i)}]Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]
=∑ZP(Z∣Y,θ(i))logP(Y,Z∣θ)=\sum\limits_ZP(Z|Y,\theta^{(i)})\log P(Y,Z|\theta)=Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ)
(3)(3)(3)求θ(i+1)\theta^{(i+1)}θ(i+1)
Q(θ(i+1),θ(i))>Q(θ(i),θ(i))Q(\theta^{(i+1)},\theta^{(i)})>Q(\theta^{(i)},\theta^{(i)})Q(θ(i+1),θ(i))>Q(θ(i),θ(i))
(4)(4)(4)重复(2),(3)(2),(3)(2),(3)
算法3
输入:观测数据,QQQ函数
输出:模型参数
(1)(1)(1)初始化参数θ(0)\theta^{(0)}θ(0), 开始迭代
(2)(2)(2)
Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]Q(\theta,\theta^{(i)})=E_Z[\log P(Y,Z|\theta)|Y,\theta^{(i)}]Q(θ,θ(i))=EZ[logP(Y,Z∣θ)∣Y,θ(i)]
=∑ZP(Z∣Y,θ(i))logP(Y,Z∣θ)=\sum\limits_ZP(Z|Y,\theta^{(i)})\log P(Y,Z|\theta)=Z∑P(Z∣Y,θ(i))logP(Y,Z∣θ)
(3)(3)(3)求θ(i+1),d\theta^{(i+1)},dθ(i+1),d次,求依次优化θi\theta_iθi,固定其他不变
Q(θ(i+1),θ(i))>Q(θ(i),θ(i))Q(\theta^{(i+1)},\theta^{(i)})>Q(\theta^{(i)},\theta^{(i)})Q(θ(i+1),θ(i))>Q(θ(i),θ(i))
(4)(4)(4)重复(2),(3)(2),(3)(2),(3)
EM算法主要用于处理含有隐变量的概率模型的参数估计问题。本文详细介绍了EM算法的引入、导出过程、在无监督学习和高斯混合模型中的应用,以及F函数的极大-极大算法和GEM算法的推广。通过迭代的E步和M步,EM算法能够逐步提高模型的似然性,直至收敛。
1413

被折叠的 条评论
为什么被折叠?



