EM算法
EM算法是含有隐变量的概率模型参数的极大似然估计法。
用YYY表示观测变量的数据,ZZZ表示隐变量的数据,θ\thetaθ表示要估计的参数,YYY和ZZZ连在一起称为完全数据,观测数据YYY称为不完全数据,假设YYY的概率分布是P(Y∣θ)P(Y\mid\theta)P(Y∣θ),那么不完全数据YYY的对数似然函数是logP(Y∣θ)\log P(Y\mid\theta)logP(Y∣θ),假设YYY和ZZZ的联合概率分布是P(Y,Z∣θ)P(Y,Z\mid\theta)P(Y,Z∣θ),那么完全数据的对数似然函数是logP(Y,Z∣θ)\log P(Y,Z\mid\theta)logP(Y,Z∣θ)。
对含有隐变量的概率模型,目标是极大化观测数据YYY对于模型参数θ\thetaθ的对数似然函数,即极大化
L(θ)=logP(Y∣θ)=log(∑zP(Y∣Z,θ)P(Z∣θ))
L(\theta)=\log P(Y\mid\theta)=\log(\sum_zP(Y\mid Z,\theta)P(Z\mid\theta))
L(θ)=logP(Y∣θ)=log(z∑P(Y∣Z,θ)P(Z∣θ))
这个式子的困难在于存在未观测数据并且包含和的对数,EM算法通过迭代逐步近似极大化L(θ)L(\theta)L(θ),假设在第iii次迭代后θ\thetaθ的估计值是θ(i)\theta^{(i)}θ(i),我们希望新的估计值θ\thetaθ能使L(θ)L(\theta)L(θ)增加,即L(θ)>L(θ(i))L(\theta)>L(\theta^{(i)})L(θ)>L(θ(i)),并逐步达到极大值,为此考虑两者的差
L(θ)−L(θ(i))=log(∑zP(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))
L(\theta)-L(\theta_{(i)})=\log(\sum_zP(Y\mid Z,\theta)P(Z\mid\theta))-\log P(Y\mid\theta^{(i)})
L(θ)−L(θ(i))=log(z∑P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))
利用JensenJensenJensen不等式
L(θ)−L(θ(i))=log(∑zP(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i)))−logP(Y∣θ(i))≥∑zP(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i)))−logP(Y∣θ(i))=∑zP(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i)))−∑zP(Z∣Y,θ(i))logP(Y∣θ(i))=∑zP(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i)))
\begin{aligned}
L(\theta)-L(\theta_{(i)})
&=\log(\sum_zP(Z\mid Y,\theta^{(i)})\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\log P(Y\mid\theta^{(i)})\\
&\geq\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\log P(Y\mid\theta^{(i)})\\
&=\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})})-\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y\mid\theta^{(i)})\\
&=\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})})
\end{aligned}
L(θ)−L(θ(i))=log(z∑P(Z∣Y,θ(i))P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))≥z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−logP(Y∣θ(i))=z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣Z,θ)P(Z∣θ))−z∑P(Z∣Y,θ(i))logP(Y∣θ(i))=z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))
令
B(θ,θ(i))=L(θ(i))+∑zP(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i)))
B(\theta,\theta^{(i)})=L(\theta^{(i)})+\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})})
B(θ,θ(i))=L(θ(i))+z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))
即
L(θ)≥B(θ,θ(i))
L(\theta)\geq B(\theta,\theta^{(i)})
L(θ)≥B(θ,θ(i))
并且L(θ(i))=B(θ(i),θ(i))L(\theta^{(i)})=B(\theta^{(i)},\theta^{(i)})L(θ(i))=B(θ(i),θ(i)),因此任何能使B(θ,θ(i))B(\theta,\theta^{(i)})B(θ,θ(i))增大的θ\thetaθ也一定能使L(θ)L(\theta)L(θ)增大,为使L(θ)L(\theta)L(θ)增长尽可能的大,应选择θ(i+1)\theta^{(i+1)}θ(i+1)使B(θ,θ(i))B(\theta,\theta^{(i)})B(θ,θ(i))达到极大,即
θ(i+1)=argmaxθB(θ,θ(i))=argmaxθL(θ(i))+∑zP(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ)P(Z∣Y,θ(i))P(Y∣θ(i)))=argmaxθ∑zP(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ)=argmaxθ∑zP(Z∣Y,θ(i))logP(Y,Z∣θ)=argmaxθEz[logP(Y,Z∣θ)∣Y,θ(i)]
\begin{aligned}
\theta^{(i+1)}
&=\arg\max\limits_{\theta}B(\theta,\theta^{(i)})\\
&=\arg\max\limits_{\theta}L(\theta^{(i)})+\sum_zP(Z\mid Y,\theta^{(i)})\log(\frac{P(Y\mid Z,\theta)P(Z\mid\theta)}{P(Z\mid Y,\theta^{(i)})P(Y\mid\theta^{(i)})})\\
&=\arg\max\limits_{\theta}\sum_zP(Z\mid Y,\theta^{(i)})\log(P(Y\mid Z,\theta)P(Z\mid\theta)\\
&=\arg\max\limits_{\theta}\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y,Z\mid\theta)\\
&=\arg\max\limits_{\theta}E_z[\log P(Y,Z\mid\theta)\mid Y,\theta^{(i)}]
\end{aligned}
θ(i+1)=argθmaxB(θ,θ(i))=argθmaxL(θ(i))+z∑P(Z∣Y,θ(i))log(P(Z∣Y,θ(i))P(Y∣θ(i))P(Y∣Z,θ)P(Z∣θ))=argθmaxz∑P(Z∣Y,θ(i))log(P(Y∣Z,θ)P(Z∣θ)=argθmaxz∑P(Z∣Y,θ(i))logP(Y,Z∣θ)=argθmaxEz[logP(Y,Z∣θ)∣Y,θ(i)]
令
Q(θ,θ(i))=Ez[logP(Y,Z∣θ)∣Y,θ(i)]
Q(\theta,\theta^{(i)})=E_z[\log P(Y,Z\mid\theta)\mid Y,\theta^{(i)}]
Q(θ,θ(i))=Ez[logP(Y,Z∣θ)∣Y,θ(i)]
即完全数据的对数似然log(Y,Z∣θ)log(Y,Z\mid\theta)log(Y,Z∣θ)关于在给定观测数据和当前参数θ(i)\theta^{(i)}θ(i)下对未观测数据ZZZ的期望,因此
θ(i+1)=argmaxθQ(θ,θ(i))
\theta^{(i+1)}=\arg\max\limits_{\theta}Q(\theta,\theta^{(i)})
θ(i+1)=argθmaxQ(θ,θ(i))
EM算法:
输入:观测数据变量YYY,隐变量数据ZZZ,联合分布P(Y,Z∣θ)P(Y,Z\mid\theta)P(Y,Z∣θ),条件分布P(Z∣Y,θ)P(Z\mid Y,\theta)P(Z∣Y,θ)
输出:模型参数θ\thetaθ
(1) 选择参数的初始值θ(0)\theta^{(0)}θ(0),开始迭代
(2) E步:计算Q(θ,θ(i))=∑zP(Z∣Y,θ(i))logP(Y,Z∣θ)Q(\theta,\theta^{(i)})=\sum_zP(Z\mid Y,\theta^{(i)})\log P(Y,Z\mid\theta)Q(θ,θ(i))=∑zP(Z∣Y,θ(i))logP(Y,Z∣θ)
(3) M步:计算θ(i+1)=argmaxθQ(θ,θ(i))\theta^{(i+1)}=\arg\max\limits_{\theta}Q(\theta,\theta^{(i)})θ(i+1)=argθmaxQ(θ,θ(i))
(4) 重复执行第(2)步和第(3)步,直至收敛
高斯混合模型
高斯混合分布是具有如下形式的概率分布
P(x∣θ)=∑k=1Kαkϕ(x∣θk)
P(x\mid\theta)=\sum_{k=1}^K\alpha_k\phi(x\mid\theta_k)
P(x∣θ)=k=1∑Kαkϕ(x∣θk)
其中αk\alpha_kαk是系数,∑k=1Kαk=1\sum_{k=1}^K\alpha_k=1∑k=1Kαk=1,αk≥0\alpha_k\geq0αk≥0,ϕ(y∣θk)\phi(y\mid\theta_k)ϕ(y∣θk)是高斯概率密度函数,θk=(μk,σk2)\theta_k=(\mu_k,\sigma_k^2)θk=(μk,σk2)
ϕ(x∣θk)=12πσkexp(−(x−μk)22σk2)
\phi(x\mid\theta_k)=\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(x-\mu_k)^2}{2\sigma_k^2})
ϕ(x∣θk)=2πσk1exp(−2σk2(x−μk)2)
假设样本的生成过程由高斯混合分布给出:首先根据α1,α2,…,αk\alpha_1,\alpha_2,\dots,\alpha_kα1,α2,…,αk选择一个高斯混合成分,然后根据被选择的高斯混合成分生成观测数据。这是观测数据是已知的,观测数据来自哪个高斯分布是未知的,以隐变量γjk\gamma_{jk}γjk表示,其定义如下:
γjk={1,第j个观测来自第k个分模型0,其他j=1,2,…,N; k=1,2,…,k
\begin{aligned}
\gamma_{jk}=&\begin{cases}
1,\quad第j个观测来自第k个分模型\\
0,\quad其他
\end{cases}\\\\
&j=1,2,\dots,N;\;k=1,2,\dots,k
\end{aligned}
γjk={1,第j个观测来自第k个分模型0,其他j=1,2,…,N;k=1,2,…,k
那么完全数据是
(yj,γj1,γj2,…,γjk),j=1,2,…,N
(y_j,\gamma_{j1},\gamma_{j2},\dots,\gamma_{jk}),\quad j=1,2,\dots,N
(yj,γj1,γj2,…,γjk),j=1,2,…,N
于是完全数据的似然函数
P(γ,y∣θ)=∏j=1NP(yj,γj1,γj2,…,γjk∣θ)=∏j=1N∏k=1K[αkϕ(xj∣θk)]γjk=∏k=1Kαknk∏j=1N[ϕ(xj∣θk)]γjk=∏k=1Kαknk∏j=1N[12πσkexp(−(xj−μk)22σk2)]γjk
\begin{aligned}
P(\gamma,y\mid\theta)
&=\prod_{j=1}^{N}P(y_j,\gamma_{j1},\gamma_{j2},\dots,\gamma_{jk}\mid\theta)\\
&=\prod_{j=1}^{N}\prod_{k=1}^{K}[\alpha_k\phi(x_j\mid\theta_k)]^{\gamma_{jk}}\\
&=\prod_{k=1}^{K}\alpha_k^{n_k}\prod_{j=1}^{N}[\phi(x_j\mid\theta_k)]^{\gamma_{jk}}\\
&=\prod_{k=1}^{K}\alpha_k^{n_k}\prod_{j=1}^{N}[\frac{1}{\sqrt{2\pi}\sigma_k}\exp(-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]^{\gamma_{jk}}
\end{aligned}
P(γ,y∣θ)=j=1∏NP(yj,γj1,γj2,…,γjk∣θ)=j=1∏Nk=1∏K[αkϕ(xj∣θk)]γjk=k=1∏Kαknkj=1∏N[ϕ(xj∣θk)]γjk=k=1∏Kαknkj=1∏N[2πσk1exp(−2σk2(xj−μk)2)]γjk
其中nk=∑j=1Nγjkn_k=\sum_{j=1}^{N}\gamma_{jk}nk=∑j=1Nγjk,∑k=1Knk=N\sum_{k=1}^{K}n_k=N∑k=1Knk=N,那么完全数据的对数似然是
logP(γ,y∣θ)=∑k=1Knklogαk+∑k=1K∑j=1N[γjk(log12π−logσk−(xj−μk)22σk2)]
\log P(\gamma,y\mid\theta)=\sum_{k=1}^{K}n_k\log\alpha_k+\sum_{k=1}^{K}\sum_{j=1}^{N}[\gamma_{jk}(\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]
logP(γ,y∣θ)=k=1∑Knklogαk+k=1∑Kj=1∑N[γjk(log2π1−logσk−2σk2(xj−μk)2)]
需要极大化的QQQ函数是
Q(θ,θ(i))=Eγ[logP(γ,y∣θ)∣y,θ(i)]=Eγ{∑k=1Knklogαk+∑k=1K∑j=1N[γjk(log12π−logσk−(xj−μk)22σk2)]}=∑k=1K{∑j=1N(Eγjk)logαk+∑j=1N(Eγjk)[log12π−logσk−(xj−μk)22σk2]}
\begin{aligned}
Q(\theta,\theta^{(i)})
&=E_\gamma[\log P(\gamma,y\mid\theta)\mid y,\theta^{(i)}]\\
&=E_\gamma\{\sum_{k=1}^{K}n_k\log\alpha_k+\sum_{k=1}^{K}\sum_{j=1}^{N}[\gamma_{jk}(\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2})]\}\\
&=\sum_{k=1}^{K}\{\sum_{j=1}^{N}(E\gamma_{jk})\log\alpha_k+\sum_{j=1}^{N}(E\gamma_{jk})[\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2}]\}
\end{aligned}
Q(θ,θ(i))=Eγ[logP(γ,y∣θ)∣y,θ(i)]=Eγ{k=1∑Knklogαk+k=1∑Kj=1∑N[γjk(log2π1−logσk−2σk2(xj−μk)2)]}=k=1∑K{j=1∑N(Eγjk)logαk+j=1∑N(Eγjk)[log2π1−logσk−2σk2(xj−μk)2]}
这里需要计算Eγjk=E(γjk∣y,θ)E\gamma_{jk}=E(\gamma_{jk}\mid y,\theta)Eγjk=E(γjk∣y,θ)
E(γjk∣y,θ)=P(γjk=1∣y,θ)=P(γjk=1,yj∣θ)P(yj∣θ)=P(γjk=1,yj∣θ)∑k=1KP(γjk=1,yj∣θ)=P(yj∣γjk=1,θ)P(γjk=1∣θ)∑k=1KP(yj∣γjk=1,θ)P(γjk=1∣θ)=αkϕ(yj∣θk)∑k=1Kαkϕ(yj∣θk)
\begin{aligned}
E(\gamma_{jk}\mid y,\theta)
&=P(\gamma_{jk}=1\mid y,\theta)\\
&=\frac{P(\gamma_{jk}=1,y_j\mid\theta)}{P(y_j\mid\theta)}\\
&=\frac{P(\gamma_{jk}=1,y_j\mid\theta)}{\sum_{k=1}^{K}P(\gamma_{jk}=1,y_j\mid\theta)}\\
&=\frac{P(y_j\mid\gamma_{jk}=1,\theta)P(\gamma_{jk}=1\mid\theta)}{\sum_{k=1}^{K}P(y_j\mid\gamma_{jk}=1,\theta)P(\gamma_{jk}=1\mid\theta)}\\
&=\frac{\alpha_k\phi(y_j\mid\theta_k)}{\sum_{k=1}^{K}\alpha_k\phi(y_j\mid\theta_k)}
\end{aligned}
E(γjk∣y,θ)=P(γjk=1∣y,θ)=P(yj∣θ)P(γjk=1,yj∣θ)=∑k=1KP(γjk=1,yj∣θ)P(γjk=1,yj∣θ)=∑k=1KP(yj∣γjk=1,θ)P(γjk=1∣θ)P(yj∣γjk=1,θ)P(γjk=1∣θ)=∑k=1Kαkϕ(yj∣θk)αkϕ(yj∣θk)
E(γjk∣y,θ)E(\gamma_{jk}\mid y,\theta)E(γjk∣y,θ)表示当前参数下第jjj个观测数据来自第kkk个混合成分的概率,记为γ^jk\hat{\gamma}_{jk}γ^jk。综上所述
Q(θ,θ(i))=∑k=1K{nklogαk+∑j=1Nγ^jk[log12π−logσk−(xj−μk)22σk2]}
Q(\theta,\theta^{(i)})=\sum_{k=1}^{K}\{n_k\log\alpha_k+\sum_{j=1}^{N}\hat{\gamma}_{jk}[\log\frac{1}{\sqrt{2\pi}}-\log\sigma_k-\frac{(x_j-\mu_k)^2}{2\sigma_k^2}]\}
Q(θ,θ(i))=k=1∑K{nklogαk+j=1∑Nγ^jk[log2π1−logσk−2σk2(xj−μk)2]}
其中nk=∑j=1Nγ^jkn_k=\sum_{j=1}^{N}\hat{\gamma}_{jk}nk=∑j=1Nγ^jk,接下来需对Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))求极大,需要求Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))对每个参数的偏导。
Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))对μk\mu_kμk的偏导:
∂Q(θ,θ(i))∂μk=∑j=1Nγjk(xj−μk)σk2
\frac{\partial Q(\theta, \theta^{(i)})}{\partial \mu_k}=\sum_{j=1}^{N}\frac{\gamma_{jk}(x_j-\mu_k)}{\sigma^2_k}
∂μk∂Q(θ,θ(i))=j=1∑Nσk2γjk(xj−μk)
所以
μ^k=∑j=1Nγjkxj∑j=1Nγjk
\hat{\mu}_k=\frac{\sum_{j=1}^{N}\gamma_{jk}x_j}{\sum_{j=1}^{N}\gamma_{jk}}
μ^k=∑j=1Nγjk∑j=1Nγjkxj
Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))对σk2\sigma_k^2σk2的偏导:
∂Q(θ,θ(i))∂σk2=−12∑j=1Nγ^jk(1σk2−(xj−μk)2σk4)
\frac{\partial Q(\theta,\theta^{(i)})}{\partial\sigma_k^2}=-\frac{1}{2}\sum_{j=1}^{N}\hat{\gamma}_{jk}(\frac{1}{\sigma^2_k}-\frac{(x_j-\mu_k)^2}{\sigma^4_k})
∂σk2∂Q(θ,θ(i))=−21j=1∑Nγ^jk(σk21−σk4(xj−μk)2)
所以
σ^k2=∑j=1Nγ^jk(xj−μk)2∑j=1Nγ^jk
\hat{\sigma}_k^2=\frac{\sum_{j=1}^{N}\hat{\gamma}^{jk}(x_j-\mu_k)^2}{\sum_{j=1}^{N}\hat{\gamma}_{jk}}
σ^k2=∑j=1Nγ^jk∑j=1Nγ^jk(xj−μk)2
Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))对αk\alpha_kαk的偏导:由于存在约束条件∑k=1Kαk=1\sum_{k=1}^{K}\alpha_k=1∑k=1Kαk=1,所以考虑Q(θ,θ(i))Q(\theta,\theta^{(i)})Q(θ,θ(i))的拉格朗日函数
L(θ,θ(i))=Q(θ,θ(i))+λ(∑k=1Kαk−1)
L(\theta,\theta^{(i)})=Q(\theta,\theta^{(i)})+\lambda(\sum_{k=1}^K\alpha_k-1)
L(θ,θ(i))=Q(θ,θ(i))+λ(k=1∑Kαk−1)
求偏导
∂L(θ,θ(i))∂αk=∑j=1Nγ^jkαk+λ
\frac{\partial L(\theta,\theta^{(i)})}{\partial\alpha_k}=\frac{\sum_{j=1}^N\hat{\gamma}_{jk}}{\alpha_k}+\lambda
∂αk∂L(θ,θ(i))=αk∑j=1Nγ^jk+λ
令偏导等于零,即
∑j=1Nγ^jk+λαk=0
\sum_{j=1}^N\hat{\gamma}_{jk}+\lambda\alpha_k=0
j=1∑Nγ^jk+λαk=0
为求解λ\lambdaλ,对所有分模型求和得
∑k=1K∑j=1Nγ^jk+λ∑k=1Kαk=0
\sum_{k=1}^K\sum_{j=1}^N\hat{\gamma}_{jk}+\lambda\sum_{k=1}^K\alpha_k=0
k=1∑Kj=1∑Nγ^jk+λk=1∑Kαk=0
解得λ=−N\lambda=-Nλ=−N,所以
α^k=∑j=1Nγ^jkN
\hat{\alpha}_k=\frac{\sum_{j=1}^{N}\hat{\gamma}_{jk}}{N}
α^k=N∑j=1Nγ^jk
多元高斯分布与此类似,更新公式几乎是一模一样的,但计算却更为复杂,涉及到复杂的求导。
根据高斯混合模型的计算结果,可以得到一种聚类方法,即根据γ\gammaγ的取值判断每个样本属于哪一类,对于样本xjx_jxj,其所属的类别是
argmaxkγjk
\arg\max\limits_{k}\gamma_{jk}
argkmaxγjk