机器学习进阶(六)EM算法

前提知识

  1. Jensen不等式
    θ 1 + θ 2 + , … , + θ n = 1 , f ( x ) \theta_1+\theta_2+,\dots,+\theta_n=1,f(x) θ1+θ2+,,+θn=1,f(x) 为凸函数,则有
    f ( θ 1 x 1 + ⋯ + θ n x n ) ≤ θ 1 f ( x 1 ) + ⋯ + θ n f ( x n ) f(\theta_1x_1+\dots+\theta_nx_n)\le\theta_1f(x_1)+\dots+\theta_nf(x_n) f(θ1x1++θnxn)θ1f(x1)++θnf(xn)
    i f p ( x ) ≥ 0 o n S ⊂ d o m f , ∫ S p ( x ) = 1 , t h e n f ( ∫ S p ( x ) x d x ) ≤ ∫ S p ( x ) f ( x ) d x if\quad p(x)\ge0\quad on \quad S\subset dom \quad f,\int_Sp(x)=1,\\ then\quad f(\int_Sp(x)xdx)\le\int_Sp(x)f(x)dx ifp(x)0onSdomf,Sp(x)=1,thenf(Sp(x)xdx)Sp(x)f(x)dx
  2. EM算法要解决的问题
    模型存在一个不能被观察到的潜变量latent variable,但是该变量会影响其他变量的取值。

GMM高斯混合模型

背景:随机变量 X X X是有 k k k个高斯分布混合而成,取各个高斯分布的概率为 π 1 , π 2 . . . π k \pi_1,\pi_2...\pi_k π1,π2...πk
i i i个高斯分布的均值为 μ i \mu_i μi,方差为 Σ i \Sigma_i Σi
设观测到随机变量 X X X的一系列样本 x 1 , x 2 , … , x n x_1,x_2,\dots,x_n x1,x2,,xn。若得到的观察数据有未观察到的隐含数据 π 1 , π 2 . . . π k \pi_1,\pi_2...\pi_k π1,π2...πk ,即上文中每个样本属于哪个分布是未知的则极大似然为 L π , μ , Σ ( x ) = ∏ i = 1 N p ( x i ) = ∏ i = 1 N ∑ k = 1 K p ( x i , π k ) = ∏ i = 1 N ( ∑ k = 1 K π i N ( x i ∣ μ i , Σ i ) ) L_{\pi,\mu,\Sigma}(x)=\prod_{i=1}^Np(x_i)=\prod_{i=1}^N\sum_{k=1}^Kp(x_i,\pi_k)=\prod_{i=1}^N\left(\sum_{k=1}^K\pi_iN(x_i|\mu_i,\Sigma_i)\right) Lπ,μ,Σ(x)=i=1Np(xi)=i=1Nk=1Kp(xi,πk)=i=1N(k=1KπiN(xiμi,Σi))
第二个等号是根据 x i x_i xi的边缘概率为 ∑ k = 1 K p ( x i , π k ) \sum_{k=1}^Kp(x_i,\pi_k) k=1Kp(xi,πk)计算得来,第三个等号是条件分布 p ( x i , π k ) = π i p ( x i , ∣ π k ) p(x_i,\pi_k)=\pi_ip(x_i,|\pi_k) p(xi,πk)=πip(xi,πk)

从而对数似然则为 l π , μ , Σ ( x ) = ∑ i = 1 N log ⁡ ∑ k = 1 K π i N ( x i ∣ μ i , Σ i ) l_{\pi,\mu,\Sigma}(x)=\sum_{i=1}^N\log\sum_{k=1}^K\pi_iN(x_i|\mu_i,\Sigma_i) lπ,μ,Σ(x)=i=1Nlogk=1KπiN(xiμi,Σi)

为了解决这个问题,分成两步:

  1. 估算数据来自的组份E-step
    估计数据由每个组份生成的概率:对于每个样本 x i x_i xi,它由第 k k k个组份生成的概率为
    γ ( i , k ) = π k N ( x i ∣ μ k , Σ k ) π j N ( x i ∣ μ j , Σ j ) \gamma(i,k)=\frac{\pi_kN(x_i|\mu_k,\Sigma_k)}{\pi_jN(x_i|\mu_j,\Sigma_j)} γ(i,k)=πjN(xiμj,Σj)πkN(xiμk,Σk)
    上式中的 μ \mu μ Σ \Sigma Σ同样也是待估计的值。
    使用采样迭代法:在计算 γ ( i , k ) \gamma(i,k) γ(i,k)时假定 μ \mu μ Σ \Sigma Σ已知,即需要先验给定 μ \mu μ Σ \Sigma Σ(对初值选择是敏感的,需要一些其他知识)。 γ ( i , k ) \gamma(i,k) γ(i,k)亦可看成组份 k k k在生成数据 x i x_i xi时所做的贡献。
  2. 估计每个组份的参数M-step
    对于样本点 x i x_i xi而言,可看成是每个组份 k k k生成 { γ ( i , k ) x i } \{\gamma(i,k)x_i\} {γ(i,k)xi}共同组成了 x i x_i xi。其中,组份 k k k是一个标准的高斯分布。
    N k = ∑ i = 1 N γ ( i , k ) μ k = 1 N k ∑ i = 1 N γ ( i , k ) x i Σ k = 1 N k ∑ i = 1 N γ ( i , k ) ( x i − μ k ) ( x i − μ k ) T π k = N k N = 1 N ∑ i = 1 N γ ( i , k ) \begin{aligned} N_{k}&=\sum_{i=1}^{N} \gamma(i, k) \\ \mu_{k}&=\frac{1}{N_{k}} \sum_{i=1}^{N} \gamma(i, k) x_{i} \\ \Sigma_{k}&=\frac{1}{N_{k}} \sum_{i=1}^{N} \gamma(i, k)\left(x_{i}-\mu_{k}\right)\left(x_{i}-\mu_{k}\right)^{T} \\ \pi_{k}&=\frac{N_{k}}{N}=\frac{1}{N} \sum_{i=1}^{N} \gamma(i, k) \end{aligned} NkμkΣkπk=i=1Nγ(i,k)=Nk1i=1Nγ(i,k)xi=Nk1i=1Nγ(i,k)(xiμk)(xiμk)T=NNk=N1i=1Nγ(i,k)

EM算法

X X X成为观察到的变量,其密度函数为 p θ ( x ) p_{\theta}(x) pθ(x)
Z Z Z成为缺失变量或潜在变量。引入其分布函数为 Q ( z ) Q(z) Q(z),密度函数为 q ( z ) q(z) q(z)
p θ ( X , Z ) p_{\theta}(X,Z) pθ(X,Z) ( X , Z ) (X,Z) (X,Z)的真实联合分布, p θ ( Z ∣ X ) p_{\theta}(Z|X) pθ(ZX)为给定 X , Z X,Z X,Z的条件分布。

对数似然函数即为:
l ( θ ) = ∑ i n log ⁡ p θ ( x ) = ∑ i n log ⁡ ∑ z p θ ( x , z ) l(\theta)=\sum_i^n \log p_{\theta}(x)=\sum_i^n \log \sum_z p_{\theta}(x,z) l(θ)=inlogpθ(x)=inlogzpθ(x,z)
z是隐变量,不方便直接找到参数估计。
策略: 计算 l ( θ ) l(θ) l(θ)下界,求该下界的最大值;重复该过程,直到收敛到局部最大值。这一迭代过程,实质是在不断地提高下界。

计算下界

q i q_i qi Z Z Z的某一个分布 , q i ≥ 0 q_i\geq0 qi0,有:
l ( θ ) = ∑ i = 1 m log ⁡ ∑ z i = 1 k p ( x i , z i ; θ ) = ∑ i = 1 m log ⁡ ∑ z i = 1 k p ( x i , z i ; θ ) = ∑ i = 1 m log ⁡ ∑ z i = 1 k q i ( z i ) p ( x i , z i ; θ ) q i ( z i ) ( J e n s e n i n e q u a t i o n ) ≥ ∑ i = 1 m ∑ z i = 1 k q i ( z i ) log ⁡ p ( x i , z i ; θ ) q i ( z i ) \begin{aligned} l(\theta)=&\sum_{i=1}^{m} \log \sum_{z_i=1}^k p(x_i, z_i ; \theta)\\ =&\sum_{i=1}^{m} \log \sum_{z_i=1}^k p\left(x_i, z_i ; \theta\right) \\ =&\sum_{i=1}^{m} \log \sum_{z_i=1}^k q_{i}\left(z_i\right) \frac{p\left(x_i, z_i ; \theta\right)}{q_{i}\left(z_i\right)} \\ (Jensen \quad inequation)\geq& \sum_{i=1}^{m} \sum_{z_i=1}^k q_{i}\left(z_i\right) \log \frac{p\left(x_i, z_i ; \theta\right)}{q_{i}\left(z_i\right)} \end{aligned} l(θ)===(Jenseninequation)i=1mlogzi=1kp(xi,zi;θ)i=1mlogzi=1kp(xi,zi;θ)i=1mlogzi=1kqi(zi)qi(zi)p(xi,zi;θ)i=1mzi=1kqi(zi)logqi(zi)p(xi,zi;θ)
p ( x i , z i ; θ ) q i ( z i ) = c , ∀ i \frac{p\left(x_i, z_i ; \theta\right)}{q_{i}\left(z_i\right)}=c,\forall i\quad qi(zi)p(xi,zi;θ)=c,i时,等号可以取到。

E-step

由于 p ( x i , z i ; θ ) q i ( z i ) = c , ∑ z i = 1 k q i ( z i ) = 1 \frac{p\left(x_i, z_i ; \theta\right)}{q_{i}\left(z_i\right)}=c,\quad \sum_{z_i=1}^k q_{i}\left(z_i\right)=1 qi(zi)p(xi,zi;θ)=c,zi=1kqi(zi)=1
可得(更新了 k × m k\times m k×m个参数)
q i ( z i ) = p ( x i , z i ; θ ) ∑ z i p ( x i , z i ; θ ) = p ( x i , z i ; θ ) p ( x i ; θ ) = p ( z i ∣ x i ; θ ) \begin{aligned} q_{i}\left(z_i\right) &=\frac{p\left(x_i, z_i ; \theta\right)}{\sum_{z_i} p\left(x_i, z_i ; \theta\right)} \\ &=\frac{p\left(x_i, z_i; \theta\right)}{p\left(x_i ; \theta\right)} \\ &=p\left(z_i \mid x_i ; \theta\right) \end{aligned} qi(zi)=zip(xi,zi;θ)p(xi,zi;θ)=p(xi;θ)p(xi,zi;θ)=p(zixi;θ)

M-step

将更新的 q i ( z i ) q_{i}\left(z_i\right) qi(zi)带入 ∑ i = 1 m ∑ z i q i ( z i ) log ⁡ p ( x i , z i ; θ ) q i ( z i ) \sum_{i=1}^{m} \sum_{z_i} q_{i}\left(z_i\right) \log \frac{p\left(x_i, z_i ; \theta\right)}{q_{i}\left(z_i\right)} i=1mziqi(zi)logqi(zi)p(xi,zi;θ)在最大化该式的过程中更新参数 θ \theta θ(若为GMM模型, θ \theta θ则包含 μ j , Σ j , π j \mu_j,\Sigma_j,\pi_j μj,Σj,πj)。
∑ i = 1 m ∑ z i = 1 k q i ( z i ) log ⁡ p ( x i , z i ; θ ) = ∑ i = 1 m ∑ z i = 1 k q i ( z i ) log ⁡ p ( x i ∣ z i ; θ ) p ( z i ; θ ) = ∑ i = 1 m ∑ z i = 1 k q i ( z i ) log ⁡ p ( x i ∣ z i ; θ ) + q i ( z i ) log ⁡ p ( z i ; θ ) \begin{aligned} &\sum_{i=1}^{m} \sum_{z_i=1}^k q_{i}(z_i)\log p\left(x_i, z_i ; \theta\right)\\ =&\sum_{i=1}^{m} \sum_{z_i=1}^k q_{i}(z_i)\log p\left(x_i| z_i ; \theta\right)p(z_i ; \theta)\\ =&\sum_{i=1}^{m} \sum_{z_i=1}^k q_{i}(z_i)\log p\left(x_i| z_i ; \theta\right)+ q_{i}(z_i)\log p(z_i ; \theta) \end{aligned} ==i=1mzi=1kqi(zi)logp(xi,zi;θ)i=1mzi=1kqi(zi)logp(xizi;θ)p(zi;θ)i=1mzi=1kqi(zi)logp(xizi;θ)+qi(zi)logp(zi;θ)

M-step 中的GMM

∑ i = 1 m ∑ z i q i ( z i ) log ⁡ p ( x i , z i ; θ ) = ∑ i = 1 m ∑ z i q i ( z i ) log ⁡ p ( x i ∣ z i ; θ ) + ∑ i = 1 m ∑ z i q i ( z i ) log ⁡ p ( z i ; θ ) \begin{aligned} &\sum_{i=1}^{m} \sum_{z_i} q_{i}(z_i)\log p\left(x_i, z_i ; \theta\right)\\ =&\sum_{i=1}^{m} \sum_{z_i} q_{i}(z_i)\log p\left(x_i| z_i ; \theta\right)+ \sum_{i=1}^{m} \sum_{z_i}q_{i}(z_i)\log p(z_i ; \theta) \end{aligned} =i=1mziqi(zi)logp(xi,zi;θ)i=1mziqi(zi)logp(xizi;θ)+i=1mziqi(zi)logp(zi;θ)
注意到,在GMM模型中,加号左边的参数只有 μ j , Σ j \mu_j,\Sigma_j μj,Σj,加号右边的参数只有 π j \pi_j πj,可以分别最大化求解:
( μ j , Σ j ) = arg max ⁡ ∑ i = 1 m ∑ z i = 1 k q i ( z i ) log ⁡ p ( x i ∣ z i ; θ ) π j = arg max ⁡ ∑ i = 1 m ∑ z i = 1 k q i ( z i ) log ⁡ p ( z i ; θ ) \begin{aligned} (\mu_j,\Sigma_j)=&\argmax\sum_{i=1}^{m} \sum_{z_i=1}^k q_{i}(z_i)\log p\left(x_i| z_i ; \theta\right)\\ \pi_j=&\argmax\sum_{i=1}^{m} \sum_{z_i=1}^k q_{i}(z_i)\log p(z_i ; \theta) \end{aligned} (μj,Σj)=πj=argmaxi=1mzi=1kqi(zi)logp(xizi;θ)argmaxi=1mzi=1kqi(zi)logp(zi;θ)
第一个式子对 μ j , Σ j \mu_j,\Sigma_j μj,Σj分别求偏导,令导函数为0,可得:

μ j = ∑ i = 1 m q i ( z j ) x i ∑ i = 1 m q i ( z j ) Σ j = ∑ i = 1 m q i ( z j ) ( x i − μ j ) ( x i − μ j ) T ∑ i = 1 m q i ( z j ) \begin{aligned} \mu_j=&\frac{\sum_{i=1}^{m}q_{i}(z_j)x_i}{\sum_{i=1}^{m}q_{i}(z_j)}\\ \Sigma_j=&\frac{\sum_{i=1}^{m}q_{i}(z_j)\left(x_{i}-\mu_{j}\right)\left(x_{i}-\mu_{j}\right)^{T}}{\sum_{i=1}^{m}q_{i}(z_j)} \end{aligned} μj=Σj=i=1mqi(zj)i=1mqi(zj)xii=1mqi(zj)i=1mqi(zj)(xiμj)(xiμj)T
第二个式子 p ( z i ; θ ) p(z_i ; \theta) p(zi;θ)求和为0的条件( ≥ 0 \ge 0 0的条件,由于对数的定义域即大于零,可省略),用拉格朗日乘子法,再求偏导
L = ∑ i = 1 m ∑ z i = 1 k q i ( z i ) log ⁡ π z i + β ( ∑ z i = 1 k π z i − 1 ) L=\sum_{i=1}^{m} \sum_{z_i=1}^k q_{i}(z_i)\log \pi_{z_i}+\beta(\sum_{z_i=1}^k \pi_{z_i}-1 ) L=i=1mzi=1kqi(zi)logπzi+β(zi=1kπzi1)
可得
∂ L ∂ π z i = ∑ i = 1 m q i ( z i ) π z i + β → β π z i = − ∑ i = 1 m q i ( z i ) → β = − m π z i = 1 m ∑ i = 1 m q i ( z i ) \frac{\partial L}{\partial \pi_{z_i}}=\sum_{i=1}^{m}\frac{q_{i}(z_i)}{\pi_{z_i}}+\beta\quad\rightarrow\beta\pi_{z_i}=-\sum_{i=1}^{m}q_{i}(z_i)\quad\rightarrow\beta=-m\\ \pi_{z_i}=\frac{1}{m}\sum_{i=1}^{m}q_{i}(z_i) πziL=i=1mπziqi(zi)+ββπzi=i=1mqi(zi)β=mπzi=m1i=1mqi(zi)

步骤

交替更新,坐标上升的过程。
E-step: q n e w ( Z ) = p θ o l d ( Z ∣ X ) q_{new}(Z) =p_{\theta_{old}}(Z \mid X) qnew(Z)=pθold(ZX)
M-step:
θ n e w = arg max ⁡ θ ∑ q ( Z ) log ⁡ p θ ( X , Z ) q ( Z ) \theta_{new}= \argmax_{\theta} \sum q(Z) \log \frac{p_{\theta}(X, Z)}{q(Z)} θnew=θargmaxq(Z)logq(Z)pθ(X,Z)

对于多个样本 ( X 1 , Z 1 ) , ( X 2 , Z 2 ) , … , ( X n , Z n ) (X_1,Z_1),(X_2,Z_2),\dots,(X_n,Z_n) (X1,Z1),(X2,Z2),,(Xn,Zn),时,
∑ i n log ⁡ p θ ( X ) = ∑ i n ( ∑ j K q ( Z ) log ⁡ p θ ( X , Z ) q ( Z ) + D k l ( Q ( Z ) ∥ P θ ( Z ∣ X ) ) ) \begin{aligned} \sum_i^n \log p_{\theta}(X) =\sum_i^n\left( \sum_j^K q(Z) \log \frac{p_{\theta}(X, Z)}{q(Z)}+D_{\mathrm{kl}}\left(Q(Z) \| P_{\theta}(Z \mid X)\right)\right) \end{aligned} inlogpθ(X)=in(jKq(Z)logq(Z)pθ(X,Z)+Dkl(Q(Z)Pθ(ZX)))

E-step:用 p θ o l d ( Z i = j ∣ X i ) p_{\theta_{old}}(Z_i=j \mid X_i) pθold(Zi=jXi)更新 q n e w ( Z i = j ) q_{new}(Z_i=j) qnew(Zi=j)
q n e w ( Z i = j ) = p θ o l d ( Z i = j ∣ X i ) = p θ o l d ( X i ∣ Z i = j ) π j ∑ j p θ o l d ( X i ∣ Z i = j ) π j q_{new}(Z_i=j) =p_{\theta_{old}}(Z_i=j \mid X_i)=\frac{p_{\theta_{old}}(X_i|Z_i=j)\pi_j}{\sum_j p_{\theta_{old}}(X_i|Z_i=j)\pi_j} qnew(Zi=j)=pθold(Zi=jXi)=jpθold(XiZi=j)πjpθold(XiZi=j)πj
M-step:更新 θ j , π j \theta_j,\pi_j θj,πj
θ n e w = arg max ⁡ θ ∑ i ∑ j q n e w ( Z i = j ) log ⁡ p θ ( X , Z i = j ) q n e w ( Z i = j ) = arg max ⁡ θ ∑ i ∑ j q n e w ( Z i = j ) log ⁡ p θ o l d ( X i ∣ Z i = j ) π j q n e w ( Z i = j ) \theta_{new}= \argmax_{\theta} \sum_i \sum_j q_{new}(Z_i=j) \log \frac{p_{\theta}(X, Z_i=j)}{q_{new}(Z_i=j)}\\ = \argmax_{\theta} \sum_i \sum_j q_{new}(Z_i=j) \log \frac{p_{\theta_{old}}(X_i|Z_i=j)\pi_j}{q_{new}(Z_i=j)} θnew=θargmaxijqnew(Zi=j)logqnew(Zi=j)pθ(X,Zi=j)=θargmaxijqnew(Zi=j)logqnew(Zi=j)pθold(XiZi=j)πj

KL散度角度的EM算法

X X X成为观察到的变量,其密度函数为 p θ ( x ) p_{\theta}(x) pθ(x)
Z Z Z成为缺失变量或潜在变量。引入其分布函数为 Q ( z ) Q(z) Q(z),密度函数为 q ( z ) q(z) q(z)
p θ ( X , Z ) p_{\theta}(X,Z) pθ(X,Z) ( X , Z ) (X,Z) (X,Z)的真实联合分布, p θ ( Z ∣ X ) p_{\theta}(Z|X) pθ(ZX)为给定 X , Z X,Z X,Z的条件分布。

EM算法的目标是最大化 E X ( log ⁡ p θ ( X ) ) E_X(\log p_{\theta}(X)) EX(logpθ(X)),引入潜变量 Z Z Z可将式子写成
max ⁡ θ E X { log ⁡ p θ ( X ) } = max ⁡ θ { ∬ p θ 0 ( X ) q ( Z ) log ⁡ p θ ( X ) d Z d X } \max _{\theta} E_{X}\left\{\log p_{\theta}(X)\right\}=\max _{\theta}\left\{\iint p_{\theta_{0}}(X) q(Z) \log p_{\theta}(X) d Z d X\right\} θmaxEX{logpθ(X)}=θmax{pθ0(X)q(Z)logpθ(X)dZdX}
理想状态下, θ \theta θ 的最佳选择 θ 0 \theta_{0} θ0,从而可得 max ⁡ θ E X { log ⁡ p θ ( X ) } = E X { log ⁡ p θ 0 ( X ) } \max _{\theta} E_{X}\left\{\log p_{\theta}(X)\right\}=E_{X}\left\{\log p_{\theta_{0}}(X)\right\} maxθEX{logpθ(X)}=EX{logpθ0(X)},等号右边可以写成
E X { log ⁡ p θ 0 ( X ) } = ∫ p θ 0 ( X ) log ⁡ p θ 0 ( X ) d X = ∬ p θ 0 ( X ) p θ 0 ( Z ∣ X ) d Z log ⁡ p θ 0 ( X ) d X = ∬ p θ 0 ( X ) p θ 0 ( Z ∣ X ) log ⁡ p θ 0 ( X ) d Z d X \begin{aligned} E_{X}\left\{\log p_{\theta_{0}}(X)\right\} &=\int p_{\theta_{0}}(X) \log p_{\theta_{0}}(X) d X \\ &=\iint p_{\theta_{0}}(X) p_{\theta_{0}}(Z \mid X) d Z \log p_{\theta_{0}}(X) d X \\ &=\iint p_{\theta_{0}}(X) p_{\theta_{0}}(Z \mid X) \log p_{\theta_{0}}(X) d Z d X \end{aligned} EX{logpθ0(X)}=pθ0(X)logpθ0(X)dX=pθ0(X)pθ0(ZX)dZlogpθ0(X)dX=pθ0(X)pθ0(ZX)logpθ0(X)dZdX
可以看出 q ( Z ) q(Z) q(Z) 的最佳替换是 p θ 0 ( Z ∣ X ) p_{\theta_{0}}(Z \mid X) pθ0(ZX)
对于最大化的目标 E X ( log ⁡ p θ ( X ) ) E_X(\log p_{\theta}(X)) EX(logpθ(X)),引入 p θ 0 ( Z ∣ X ) p_{\theta_{0}}(Z \mid X) pθ0(ZX)
E X { log ⁡ p θ ( X ) } = E X { ∫ q ( Z ∣ X ) log ⁡ p θ ( X ) d Z } ( ∫ q ( Z ∣ X ) d Z = 1 ) = E X { ∫ q ( Z ∣ X ) log ⁡ p θ ( X , Z ) p θ ( Z ∣ X ) d Z } ( p θ ( Z ∣ X ) = p θ ( X , Z ) p θ ( X ) ) = E X { ∫ q ( Z ∣ X ) log ⁡ p θ ( X , Z ) / q ( Z ∣ X ) p θ ( Z ∣ X ) / q ( Z ∣ X ) d Z } = E X { ∫ q ( Z ∣ X ) log ⁡ p θ ( X , Z ) q ( Z ∣ X ) d Z − ∫ q ( Z ∣ X ) log ⁡ p θ ( Z ∣ X ) q ( Z ∣ X ) d Z } = E X { ∫ q ( Z ∣ X ) log ⁡ p θ ( X , Z ) q ( Z ∣ X ) d Z + D k l ( Q ( Z ∣ X ) ∥ P θ ( Z ∣ X ) } \begin{aligned} & E_{X}\left\{\log p_{\theta}(X)\right\} \\ =& E_{X}\left\{\int q(Z \mid X) \log p_{\theta}(X) d Z\right\} \quad\left(\int q(Z \mid X) d Z=1\right) \\ =& E_{X}\left\{\int q(Z \mid X) \log \frac{p_{\theta}(X, Z)}{p_{\theta}(Z \mid X)} d Z\right\}\left(p_{\theta}(Z \mid X)=\frac{p_{\theta}(X, Z)}{p_{\theta}(X)}\right) \\ =& E_{X}\left\{\int q(Z \mid X) \log \frac{p_{\theta}(X, Z) / q(Z \mid X)}{p_{\theta}(Z \mid X) / q(Z \mid X)} d Z\right\} \\ =& E_{X}\left\{\int q(Z \mid X) \log \frac{p_{\theta}(X, Z)}{q(Z \mid X)} d Z-\int q(Z \mid X) \log \frac{p_{\theta}(Z \mid X)}{q(Z \mid X)} d Z\right\} \\ =& E_{X}\left\{\int q(Z \mid X) \log \frac{p_{\theta}(X, Z)}{q(Z \mid X)} d Z+D_{\mathrm{kl}}\left(Q(Z \mid X) \| P_{\theta}(Z \mid X)\right\}\right. \end{aligned} =====EX{logpθ(X)}EX{q(ZX)logpθ(X)dZ}(q(ZX)dZ=1)EX{q(ZX)logpθ(ZX)pθ(X,Z)dZ}(pθ(ZX)=pθ(X)pθ(X,Z))EX{q(ZX)logpθ(ZX)/q(ZX)pθ(X,Z)/q(ZX)dZ}EX{q(ZX)logq(ZX)pθ(X,Z)dZq(ZX)logq(ZX)pθ(ZX)dZ}EX{q(ZX)logq(ZX)pθ(X,Z)dZ+Dkl(Q(ZX)Pθ(ZX)}
给定 θ old  \theta_{\text {old }} θold ,E-Step实际做的事情是用 q ( Z ∣ X ) q(Z \mid X) q(ZX)替代 P θ ( Z ∣ X ) P_{\theta}(Z \mid X) Pθ(ZX),即在KL散度意义下,用 q ( Z ∣ X ) q(Z \mid X) q(ZX)逼近 P θ ( Z ∣ X ) P_{\theta}(Z \mid X) Pθ(ZX)
在理想状态下
在这里插入图片描述
EM算法是一个交替更新的过程,先更新红色部分,再更新蓝色的部分。在更新蓝色部分时,对数中的分母不含需要最大化的参数,可忽略,从而M-step从KL散度看,如下,M-step也是一个KL散度逼近的问题。
argmax ⁡ θ E X { ∫ p θ old  ( Z ∣ X ) log ⁡ p θ ( X ∣ Z ) p θ ( Z ) d Z } = argmax ⁡ θ E X { ∫ q new  ( Z ∣ X ) log ⁡ p θ ( X ∣ Z ) p θ ( Z ) p θ 0 ( X ) q new  ( Z ∣ X ) d Z } = argmax ⁡ θ { ∫ p θ 0 ( X ) q new  ( Z ∣ X ) log ⁡ p θ ( X , Z ) p θ 0 ( X ) q new  ( Z ∣ X ) d Z d X } = argmin ⁡ θ D k l { Q new  ( Z ∣ X ) P θ 0 ( X ) ∥ P θ ( X , Z ) } \begin{aligned} & \underset{\theta}{\operatorname{argmax}} E_{X}\left\{\int p_{\theta_{\text {old }}}(Z \mid X) \log p_{\theta}(X \mid Z) p_{\theta}(Z) d Z\right\} \\ =& \underset{\theta}{\operatorname{argmax}} E_{X}\left\{\int q_{\text {new }}(Z \mid X) \log \frac{p_{\theta}(X \mid Z) p_{\theta}(Z)}{p_{\theta_{0}}(X) q_{\text {new }}(Z \mid X)} d Z\right\} \\ =& \underset{\theta}{\operatorname{argmax}}\left\{\int p_{\theta_{0}}(X) q_{\text {new }}(Z \mid X) \log \frac{p_{\theta}(X, Z)}{p_{\theta_{0}}(X) q_{\text {new }}(Z \mid X)} d Z d X\right\} \\ =& \underset{\theta}{\operatorname{argmin}} D_{\mathrm{kl}}\left\{Q_{\text {new }}(Z \mid X) P_{\theta_{0}}(X) \| P_{\theta}(X, Z)\right\} \end{aligned} ===θargmaxEX{pθold (ZX)logpθ(XZ)pθ(Z)dZ}θargmaxEX{qnew (ZX)logpθ0(X)qnew (ZX)pθ(XZ)pθ(Z)dZ}θargmax{pθ0(X)qnew (ZX)logpθ0(X)qnew (ZX)pθ(X,Z)dZdX}θargminDkl{Qnew (ZX)Pθ0(X)Pθ(X,Z)}

总结

  1. E-step是一个无约束的优化问题,
    Q new  ( Z ∣ X ) = argmin ⁡ Q ( Z ∣ X ) E X [ D k l ( Q ( Z ∣ X ) ∥ P θ  old  ( Z ∣ X ) ) ] Q_{\text {new }}(Z \mid X)=\underset{Q(Z \mid X)}{\operatorname{argmin}} E_{X}\left[D_{\mathrm{kl}}\left(Q(Z \mid X) \| P_{\theta \text { old }}(Z \mid X)\right)\right] Qnew (ZX)=Q(ZX)argminEX[Dkl(Q(ZX)Pθ old (ZX))]
    给定 θ old  \theta^{\text {old }} θold ,直接令 q ( Z ∣ X ) = P θ ( Z ∣ X ) q(Z \mid X)=P_{\theta}(Z \mid X) q(ZX)=Pθ(ZX)即可。

  2. M-step是一个有约束的优化问题。
    P θ new  ( X ∣ Z ) P_{\theta^{\text {new }}}(X \mid Z) Pθnew (XZ) P θ new  ( Z ) P_{\theta^{\text {new }}}(Z) Pθnew (Z)均为事先确定的分布簇,如 G M M GMM GMM中前者为正态分布,后者为多项分布。所以这里是在优化分布,使得这两个分布的成绩和 Z ∣ X ) P θ 0 ( X ) Z \mid X) P_{\theta_{0}}(X) ZX)Pθ0(X)的分布在KL散度意义下很接近。
    P θ new  ( X ∣ Z ) P θ new  ( Z ) = argmin ⁡ P θ ( X ∣ Z ) P θ ( Z ) ∈ F θ D k l { Q new  ( Z ∣ X ) P θ 0 ( X ) ∥ P θ ( X ∣ Z ) P θ ( Z ) } \begin{array}{l} P_{\theta^{\text {new }}}(X \mid Z) P_{\theta^{\text {new }}}(Z) \\ =\underset{P_{\theta}(X \mid Z) P_{\theta}(Z) \in \mathcal{F}_{\theta}}{\operatorname{argmin}} D_{\mathrm{kl}}\left\{Q_{\text {new }}(Z \mid X) P_{\theta_{0}}(X) \| P_{\theta}(X \mid Z) P_{\theta}(Z)\right\} \end{array} Pθnew (XZ)Pθnew (Z)=Pθ(XZ)Pθ(Z)FθargminDkl{Qnew (ZX)Pθ0(X)Pθ(XZ)Pθ(Z)}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值