EM算法详解

EM算法

EM算法是含有隐变量的概率模型参数的极大似然估计法。

YYY表示观测变量的数据,ZZZ表示隐变量的数据,θ\thetaθ表示要估计的参数,YYYZZZ连在一起称为完全数据,观测数据YYY称为不完全数据,假设YYY的概率分布是P(Y∣θ)P(Y\mid\theta)P(Yθ),那么不完全数据YYY的对数似然函数是log⁡P(Y∣θ)\log P(Y\mid\theta)logP(Yθ),假设YYYZZZ的联合概率分布是P(Y,Z∣θ)P(Y,Z\mid\theta)P(Y,Zθ),那么完全数据的对数似然函数是log⁡P(Y,Z∣θ)\log P(Y,Z\mid\theta)logP(Y,Zθ)

对含有隐变量的概率模型,目标是极大化观测数据YYY对于模型参数θ\thetaθ的对数似然函数,即极大化
L(θ)=log⁡P(Y∣θ)=log⁡(∑zP(Y∣Z,θ)P(Z∣θ)) L(\theta)=\log P(Y\mid\theta)=\log(\sum_zP(Y\mid Z,\theta)P(Z\mid\theta)) L(θ)=logP(Yθ)=log(zP(YZ,θ)P(Zθ))
这个式子的困难在于存在未观测数据并且包含和的对数,EM算法通过迭代逐步近似极大化L(θ)L(\theta)L(θ),假设在第iii次迭代后θ\thetaθ的估计值是θ(i)\theta^{(i)}θ(i),我们希望新的估计值θ\thetaθ能使L(θ)L(\theta)L(θ)增加,即L(θ)>L(θ(i))L(\theta)>L(\theta^{(i)})L(θ)>L(θ(i)),并逐步达到极大值,为此考虑两者的差
L(θ)−L(θ(i))=log⁡(∑zP(Y∣Z,θ)P(Z∣θ))−log⁡P(Y∣θ(i)) L(\theta)-L(\theta_{(i)})=\log(\sum_zP(Y\mid Z,\theta)P(Z\mid\theta))-\log P(Y\mid\theta^{(i)}) L(θ)L(θ(i))=log(zP(YZ,θ)P(Zθ))logP(Yθ(i))
利用JensenJensenJensen不等式
L(θ)−L(θ(i))=log⁡(∑zP(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i)))−log⁡P(Y∣θ(i))≥∑zP(Z∣Y,θ(i))log⁡(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i)))−log⁡P(Y∣θ(i))=∑zP(Z∣Y,θ(i))log⁡(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i)))−∑zP(Z∣Y,θ(i))log⁡P(Y∣θ(i))=∑zP(Z∣Y,θ(i))log⁡(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i))) \begin{aligned} L(\theta)-L(\theta_{(i)}) &=\log(\sum_zP(Z\mid Y,\theta^{(i)})\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\log P(Y\mid\theta^{(i)})\\ &\geq\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\log P(Y\mid\theta^{(i)})\\ &=\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y\mid\theta^{(i)})\\ &=\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})}) \end{aligned} L(θ)L(θ(i))=log(zP(ZY,θ(i))P(ZY,θ(i))P(YZ,θ)P(Zθ))logP(Yθ(i))zP(ZY,θ(i))log(P(ZY,θ(i))P(YZ,θ)P(Zθ))logP(Yθ(i))=zP(ZY,θ(i))log(P(ZY,θ(i))P(YZ,θ)P(Zθ))zP(ZY,θ(i))logP(Yθ(i))=zP(ZY,θ(i))log(P(ZY,θ(i))P(Yθ(i))P(YZ,θ)P(Zθ))

B(θ,θ(i))=L(θ(i))+∑zP(Z∣Y,θ(i))log⁡(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i))) B(\theta,\theta^{(i)})=L(\theta^{(i)})+\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})}) B(θ,θ(i))=L(θ(i))+zP(ZY,θ(i))log(P(ZY,θ(i))P(Yθ(i))P(YZ,θ)P(Zθ))


L(θ)≥B(θ,θ(i)) L(\theta)\geq B(\theta,\theta^{(i)}) L(θ)B(θ,θ(i))
并且L(θ(i))=B(θ(i),θ(i))L(\theta^{(i)})=B(\theta^{(i)},\theta^{(i)})L(θ(i))=B(θ(i),θ(i)),因此任何能使B(θ,θ(i))B(\theta,\theta^{(i)})B(θ,θ(i))增大的θ\thetaθ也一定能使L(θ)L(\theta)L(θ)增大,为使L(θ)L(\theta)L(θ)增长尽可能的大,应选择θ(i+1)\theta^{(i+1)}θ(i+1)使B(θ,θ(i))B(\theta,\theta^{(i)})B(θ,θ(i))达到极大,即
θ(i+1)=arg⁡max⁡θB(θ,θ(i))=arg⁡max⁡θL(θ(i))+∑zP(Z∣Y,θ(i))log⁡(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i)))=arg⁡max⁡θ∑zP(Z∣Y,θ(i))log⁡(P(Y∣Z,θ)P(Z∣θ)=arg⁡max⁡θ∑zP(Z∣Y,θ(i))log⁡P(Y,Z∣θ)=arg⁡max⁡θEz[log⁡P(Y,Z∣θ)∣Y,θ(i)] \begin{aligned} \theta^{(i+1)} &=\arg\max\limits_{\theta}B(\theta,\theta^{(i)})\\ &=\arg\max\limits_{\theta}L(\theta^{(i)})+\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})})\\ &=\arg\max\limits_{\theta}\sum_zP(Z\mid Y,\theta^{(i)})\log(P(Y\mid Z,\theta)P(Z\mid\theta)\\ &=\arg\max\limits_{\theta}\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y,Z\mid\theta)\\ &=\arg\max\limits_{\theta}E_z[\log P(Y,Z\mid\theta)\mid Y,\theta^{(i)}] \end{aligned} θ(i+1)=argθmaxB(θ,θ(i))=argθmaxL(θ(i))+zP(ZY,θ(i))log(P(ZY,θ(i))P(Yθ(i))P(YZ,θ)P(Zθ))=argθmaxzP(ZY,θ(i))log(P(YZ,θ)P(Zθ)=argθmaxzP(ZY,θ(i))logP(Y,Zθ)=argθmaxEz[logP(Y,Zθ)Y,θ(i)]

Q(θ,θ(i))=Ez[log⁡P(Y,Z∣θ)∣Y,θ(i)] Q(\theta,\theta^{(i)})=E_z[\log P(Y,Z\mid\theta)\mid Y,\theta^{(i)}] Q(θ,θ(i))=Ez[logP(Y,Zθ)Y,θ(i)]
即完全数据的对数似然log(Y,Z∣θ)log(Y,Z\mid\theta)log(Y,Zθ)关于在给定观测数据和当前参数θ(i)\theta^{(i)}θ(i)下对未观测数据ZZZ的期望,因此
θ(i+1)=arg⁡max⁡θQ(θ,θ(i)) \theta^{(i+1)}=\arg\max\limits_{\theta}Q(\theta,\theta^{(i)}) θ(i+1)=argθmaxQ(θ,θ(i))
EM算法

输入:观测数据变量YYY,隐变量数据ZZZ,联合分布P(Y,Z∣θ)P(Y,Z\mid\theta)P(Y,Zθ),条件分布P(Z∣Y,θ)P(Z\mid Y,\theta)P(ZY,θ)

输出:模型参数θ\thetaθ

(1) 选择参数的初始值θ(0)\theta^{(0)}θ(0),开始迭代

(2) E步:计算Q(θ,θ(i))=∑zP(Z∣Y,θ(i))log⁡P(Y,Z∣θ)Q(\theta,\theta^{(i)})=\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y,Z\mid\theta)Q(θ,θ(i))=zP(ZY,θ(i))logP(Y,Zθ)

(3) M步:计算θ(i+1)=arg⁡max⁡θQ(θ,θ(i))\theta^{(i+1)}=\arg\max\limits_{\theta}Q(\theta,\theta^{(i)})θ(i+1)=argθmaxQ(θ,θ(i))

(4) 重复执行第(2)步和第(3)步,直至收敛

高斯混合模型

高斯混合分布是具有如下形式的概率分布
P(x∣θ)=∑k=1Kαkϕ(x∣θk) P(x\mid\theta)=\sum_{k=1}^K\alpha_k\phi(x\mid\theta_k) P(xθ)=k=1Kαkϕ(xθk)
其中αk\alpha_kαk是系数,∑k=1Kαk=1\sum_{k=1}^K\alpha_k=1k=1Kαk=1αk≥0\alpha_k\geq0αk0ϕ(y∣θk)\phi(y\mid\theta_k)ϕ(yθk)是高斯概率密度函数,θk=(μk,σk2)\theta_k=(\mu_k,\sigma_k^2)θk=(μk,σk2)
ϕ(x∣θk)=12πσkexp⁡(−(x−μk)22σk2) \phi(x\mid\theta_k)=\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2}) ϕ(xθk)=2πσk1exp(2σk2(xμk)2)
假设样本的生成过程由高斯混合分布给出:首先根据α1,α2,…,αk\alpha_1,\alpha_2,\dots,\alpha_kα1,α2,,αk选择一个高斯混合成分,然后根据被选择的高斯混合成分生成观测数据。这是观测数据是已知的,观测数据来自哪个高斯分布是未知的,以隐变量γjk\gamma_{jk}γjk表示,其定义如下:
γjk={1,第j个观测来自第k个分模型0,其他j=1,2,…,N;  k=1,2,…,k \begin{aligned} \gamma_{jk}=&\begin{cases} 1,\quad第j个观测来自第k个分模型\\ 0,\quad其他 \end{cases}\\\\ &j=1,2,\dots,N;\;k=1,2,\dots,k \end{aligned} γjk={1,jk0,j=1,2,,N;k=1,2,,k
那么完全数据是
(yj,γj1,γj2,…,γjk),j=1,2,…,N (y_j,\gamma_{j1},\gamma_{j2},\dots,\gamma_{jk}),\quad j=1,2,\dots,N (yj,γj1,γj2,,γjk),j=1,2,,N
于是完全数据的似然函数
P(γ,y∣θ)=∏j=1NP(yj,γj1,γj2,…,γjk∣θ)=∏j=1N∏k=1K[αkϕ(xj∣θk)]γjk=∏k=1Kαknk∏j=1N[ϕ(xj∣θk)]γjk=∏k=1Kαknk∏j=1N[12πσkexp⁡(−(xj−μk)22σk2)]γjk \begin{aligned} P(\gamma,y\mid\theta) &=\prod_{j=1}^{N}P(y_j,\gamma_{j1},\gamma_{j2},\dots,\gamma_{jk}\mid\theta)\\ &=\prod_{j=1}^{N}\prod_{k=1}^{K}[\alpha_k\phi(x_j\mid\theta_k)]^{\gamma_{jk}}\\ &=\prod_{k=1}^{K}\alpha_k^{n_k}\prod_{j=1}^{N}[\phi(x_j\mid\theta_k)]^{\gamma_{jk}}\\ &=\prod_{k=1}^{K}\alpha_k^{n_k}\prod_{j=1}^{N}[\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]^{\gamma_{jk}} \end{aligned} P(γ,yθ)=j=1NP(yj,γj1,γj2,,γjkθ)=j=1Nk=1K[αkϕ(xjθk)]γjk=k=1Kαknkj=1N[ϕ(xjθk)]γjk=k=1Kαknkj=1N[2πσk1exp(2σk2(xjμk)2)]γjk
其中nk=∑j=1Nγjkn_k=\sum_{j=1}^{N}\gamma_{jk}nk=j=1Nγjk∑k=1Knk=N\sum_{k=1}^{K}n_k=Nk=1Knk=N,那么完全数据的对数似然是
log⁡P(γ,y∣θ)=∑k=1Knklog⁡αk+∑k=1K∑j=1N[γjk(log⁡12π−log⁡σk−(xj−μk)22σk2)] \log P(\gamma,y\mid\theta)=\sum_{k=1}^{K}n_k\log\alpha_k+\sum_{k=1}^{K}\sum_{j=1}^{N}[\gamma_{jk}(\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})] logP(γ,yθ)=k=1Knklogαk+k=1Kj=1N[γjk(log2π1logσk2σk2(xjμk)2)]
需要极大化的QQQ函数是
Q(θ,θ(i))=Eγ[log⁡P(γ,y∣θ)∣y,θ(i)]=Eγ{∑k=1Knklog⁡αk+∑k=1K∑j=1N[γjk(log⁡12π−log⁡σk−(xj−μk)22σk2)]}=∑k=1K{∑j=1N(Eγjk)log⁡αk+∑j=1N(Eγjk)[log⁡12π−log⁡σk−(xj−μk)22σk2]} \begin{aligned} Q(\theta,\theta^{(i)}) &=E_\gamma[\log P(\gamma,y\mid\theta)\mid y,\theta^{(i)}]\\ &=E_\gamma\{\sum_{k=1}^{K}n_k\log\alpha_k+\sum_{k=1}^{K}\sum_{j=1}^{N}[\gamma_{jk}(\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]\}\\ &=\sum_{k=1}^{K}\{\sum_{j=1}^{N}(E\gamma_{jk})\log\alpha_k+\sum_{j=1}^{N}(E\gamma_{jk})[\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2}]\} \end{aligned} Q(θ,θ(i))=Eγ[logP(γ,yθ)y,θ(i)]=Eγ{k=1Knklogαk+k=1Kj=1N[γjk(log2π1logσk2σk2(xjμk)2)]}=k=1K{j=1N(Eγjk)logαk+j=1N(Eγjk)[log2π1logσk2σk2(xjμk)2]}
这里需要计算Eγjk=E(γjk∣y,θ)E\gamma_{jk}=E(\gamma_{jk}\mid y,\theta)Eγjk=E(γjky,θ)
E(γjk∣y,θ)=P(γjk=1∣y,θ)=P(γjk=1,yj∣θ)P(yj∣θ)=P(γjk=1,yj∣θ)∑k=1KP(γjk=1,yj∣θ)=P(yj∣γjk=1,θ)P(γjk=1∣θ)∑k=1KP(yj∣γjk=1,θ)P(γjk=1∣θ)=αkϕ(yj∣θk)∑k=1Kαkϕ(yj∣θk) \begin{aligned} E(\gamma_{jk}\mid y,\theta) &=P(\gamma_{jk}=1\mid y,\theta)\\ &=\frac{P(\gamma_{jk}=1,y_j\mid\theta)}{P(y_j\mid\theta)}\\ &=\frac{P(\gamma_{jk}=1,y_j\mid\theta)}{\sum_{k=1}^{K}P(\gamma_{jk}=1,y_j\mid\theta)}\\ &=\frac{P(y_j\mid\gamma_{jk}=1,\theta)P(\gamma_{jk}=1\mid\theta)}{\sum_{k=1}^{K}P(y_j\mid\gamma_{jk}=1,\theta)P(\gamma_{jk}=1\mid\theta)}\\ &=\frac{\alpha_k\phi(y_j\mid\theta_k)}{\sum_{k=1}^{K}\alpha_k\phi(y_j\mid\theta_k)} \end{aligned} E(γjky,θ)=P(γjk=1y,θ)=P(yjθ)P(γjk=1,yjθ)=k=1KP(γjk=1,yjθ)P(γjk=1,yjθ)=k=1KP(yjγjk=1,θ)P(γjk=1θ)P(yjγjk=1,θ)P(γjk=1θ)=k=1Kαkϕ(yjθk)αkϕ(yjθk)
E(γjk∣y,θ)E(\gamma_{jk}\mid y,\theta)E(γjky,θ)表示当前参数下第jjj个观测数据来自第kkk个混合成分的概率,记为γ^jk\hat{\gamma}_{jk}γ^jk。综上所述
Q(θ,θ(i))=∑k=1K{nklog⁡αk+∑j=1Nγ^jk[log⁡12π−log⁡σk−(xj−μk)22σk2]} Q(\theta,\theta^{(i)})=\sum_{k=1}^{K}\{n_k\log\alpha_k+\sum_{j=1}^{N}\hat{\gamma}_{jk}[\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2}]\} Q(θ,θ(i))=k=1K{nklogαk+j=1Nγ^jk[log2π1logσk2σk2(xjμk)2]}
其中nk=∑j=1Nγ^jkn_k=\sum_{j=1}^{N}\hat{\gamma}_{jk}nk=j=1Nγ^jk,接下来需对Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))求极大,需要求Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))对每个参数的偏导。

Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))μk\mu_kμk的偏导
∂Q(θ,θ(i))∂μk=∑j=1Nγjk(xj−μk)σk2 \frac{\partial Q(\theta, \theta^{(i)})}{\partial \mu_k}=\sum_{j=1}^{N}\frac{\gamma_{jk}(x_j-\mu_k)}{\sigma^2_k} μkQ(θ,θ(i))=j=1Nσk2γjk(xjμk)
所以
μ^k=∑j=1Nγjkxj∑j=1Nγjk \hat{\mu}_k=\frac{\sum_{j=1}^{N}\gamma_{jk}x_j}{\sum_{j=1}^{N}\gamma_{jk}} μ^k=j=1Nγjkj=1Nγjkxj
Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))σk2\sigma_k^2σk2的偏导
∂Q(θ,θ(i))∂σk2=−12∑j=1Nγ^jk(1σk2−(xj−μk)2σk4) \frac{\partial Q(\theta,\theta^{(i)})}{\partial\sigma_k^2}=-\frac{1}{2}\sum_{j=1}^{N}\hat{\gamma}_{jk}(\frac{1}{\sigma^2_k}-\frac{(x_j-\mu_k)^2}{\sigma^4_k}) σk2Q(θ,θ(i))=21j=1Nγ^jk(σk21σk4(xjμk)2)
所以
σ^k2=∑j=1Nγ^jk(xj−μk)2∑j=1Nγ^jk \hat{\sigma}_k^2=\frac{\sum_{j=1}^{N}\hat{\gamma}^{jk}(x_j-\mu_k)^2}{\sum_{j=1}^{N}\hat{\gamma}_{jk}} σ^k2=j=1Nγ^jkj=1Nγ^jk(xjμk)2
Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))αk\alpha_kαk的偏导:由于存在约束条件∑k=1Kαk=1\sum_{k=1}^{K}\alpha_k=1k=1Kαk=1,所以考虑Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))的拉格朗日函数
L(θ,θ(i))=Q(θ,θ(i))+λ(∑k=1Kαk−1) L(\theta,\theta^{(i)})=Q(\theta,\theta^{(i)})+\lambda(\sum_{k=1}^K\alpha_k-1) L(θ,θ(i))=Q(θ,θ(i))+λ(k=1Kαk1)
求偏导
∂L(θ,θ(i))∂αk=∑j=1Nγ^jkαk+λ \frac{\partial L(\theta,\theta^{(i)})}{\partial\alpha_k}=\frac{\sum_{j=1}^N\hat{\gamma}_{jk}}{\alpha_k}+\lambda αkL(θ,θ(i))=αkj=1Nγ^jk+λ
令偏导等于零,即
∑j=1Nγ^jk+λαk=0 \sum_{j=1}^N\hat{\gamma}_{jk}+\lambda\alpha_k=0 j=1Nγ^jk+λαk=0
为求解λ\lambdaλ,对所有分模型求和得
∑k=1K∑j=1Nγ^jk+λ∑k=1Kαk=0 \sum_{k=1}^K\sum_{j=1}^N\hat{\gamma}_{jk}+\lambda\sum_{k=1}^K\alpha_k=0 k=1Kj=1Nγ^jk+λk=1Kαk=0
解得λ=−N\lambda=-Nλ=N,所以
α^k=∑j=1Nγ^jkN \hat{\alpha}_k=\frac{\sum_{j=1}^{N}\hat{\gamma}_{jk}}{N} α^k=Nj=1Nγ^jk

多元高斯分布与此类似,更新公式几乎是一模一样的,但计算却更为复杂,涉及到复杂的求导。

根据高斯混合模型的计算结果,可以得到一种聚类方法,即根据γ\gammaγ的取值判断每个样本属于哪一类,对于样本xjx_jxj,其所属的类别是
arg⁡max⁡kγjk \arg\max\limits_{k}\gamma_{jk} argkmaxγjk

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值