变分推断

变分推断

变分推断就是用简单的分布q去近似复杂的分布p。

以上这句话是对变分推断最直接的理解。那么,为什么要选择用变分推断?

因为,大多数情况下后验分布很难求。如果后验概率好求解的话我们直接采用EM就可以。当后验分布难于求解的时候我们就希望选择一些简单的分布来近似这些复杂的后验分布。特别的,如果我们选择的q是指数族内的分布,更易于积分求解

Evidence Lower Bound(ELOB)

由贝叶斯公式可得 ln(p(X))=ln(p(X,Z)ln(p(Z|X)) l n ( p ( X ) ) = l n ( p ( X , Z ) − l n ( p ( Z | X ) ) 继续推导如下:

ln(p(X))=ln(p(X,Z)ln(p(Z|X)) l n ( p ( X ) ) = l n ( p ( X , Z ) − l n ( p ( Z | X ) )
ln(p(X))=ln(p(X,Z)ln(q(Z))(ln(p(Z|X))ln(q(Z))) l n ( p ( X ) ) = l n ( p ( X , Z ) − l n ( q ( Z ) ) − ( l n ( p ( Z | X ) ) − l n ( q ( Z ) ) )
ln(p(X))=ln(P(X,Z)q(Z))ln(p(Z|X)ln(q(Z)) l n ( p ( X ) ) = l n ( P ( X , Z ) q ( Z ) ) − l n ( p ( Z | X ) l n ( q ( Z ) )
q(Z) q ( Z ) 为概率密度对其积分:
ln(p(X))=q(Z)ln(P(X,Z)q(Z))dZq(Z)ln(p(Z|X)ln(q(Z))dZ l n ( p ( X ) ) = ∫ q ( Z ) l n ( P ( X , Z ) q ( Z ) ) d Z − ∫ q ( Z ) l n ( p ( Z | X ) l n ( q ( Z ) ) d Z
ln(p(X))=q(Z)ln(p(X,Z))dZq(Z)lnq(Z)dZq(Z)ln(p(Z|X)ln(q(Z))dZ l n ( p ( X ) ) = ∫ q ( Z ) l n ( p ( X , Z ) ) d Z − ∫ q ( Z ) l n q ( Z ) d Z − ∫ q ( Z ) l n ( p ( Z | X ) l n ( q ( Z ) ) d Z

其中 q(Z)ln(p(X,Z))dZq(Z)ln(q(Z))dZ ∫ q ( Z ) l n ( p ( X , Z ) ) d Z − ∫ q ( Z ) l n ( q ( Z ) ) d Z (q) L ( q ) ELOB , q(Z)ln(p(Z|X)ln(q(Z))dZ ∫ q ( Z ) l n ( p ( Z | X ) l n ( q ( Z ) ) d Z (q//p) K L ( q / / p )

我们要做的就是不断更新 q(Z) q ( Z ) 使得 q(Z) q ( Z ) 不断的接近 p p 分布,这样他们的分布距离越小,Lower bound则越大。

另一个问题,如何选择q(Z)

q(Z) q ( Z ) 的选择

方式一: 选择各维度独立 q(Z)=Mi=1qi(Zi) q ( Z ) = ∏ i = 1 M q i ( Z i )

(q)=q(Z)ln(p(X,Z))dZq(Z)ln(q(Z))dZ L ( q ) = ∫ q ( Z ) l n ( p ( X , Z ) ) d Z − ∫ q ( Z ) l n ( q ( Z ) ) d Z
=Mi=1qi(Zi)ln(p(X,Z))dZMi=1qi(Zi)lnMi=1qi(Zi)dZ = ∏ i = 1 M q i ( Z i ) l n ( p ( X , Z ) ) d Z − ∫ ∏ i = 1 M q i ( Z i ) l n ∏ i = 1 M q i ( Z i ) d Z

考虑第一项:
Mi=1qi(Zi)ln(p(X,Z))dZ ∏ i = 1 M q i ( Z i ) l n ( p ( X , Z ) ) d Z
=z1z2..zMMi=1qi(Zi)ln(p(X,Z))dz1dz2...dzM = ∫ z 1 ∫ z 2 . . ∫ z M ∏ i = 1 M q i ( Z i ) l n ( p ( X , Z ) ) d z 1 d z 2 . . . d z M
=zjqj(Zj)(zij...Mijqi(Zi)lnp(X,Z)MijdZi)dZj = ∫ z j q j ( Z j ) ( ∫ z i ≠ j . . . ∫ ∏ i ≠ j M q i ( Z i ) l n p ( X , Z ) ∏ i ≠ j M d Z i ) d Z j
=zjqj(Zj)(zij...lnp(X,Z)Mijqi(Zi)dZi)dZj = ∫ z j q j ( Z j ) ( ∫ z i ≠ j . . . ∫ l n p ( X , Z ) ∏ i ≠ j M q i ( Z i ) d Z i ) d Z j
=zjqj(Zj)Eijln(p(X,Z))dZj = ∫ z j q j ( Z j ) E i ≠ j l n ( p ( X , Z ) ) d Z j

考虑第二项:
在推导第二项之前:
x1x2(f(x1)+f(x2))p(x1,x2)dx1dx2 ∫ x 1 ∫ x 2 ( f ( x 1 ) + f ( x 2 ) ) p ( x 1 , x 2 ) d x 1 d x 2
=x1f(x1)dx1+x2f(x2)dx2 = ∫ x 1 f ( x 1 ) d x 1 + ∫ x 2 f ( x 2 ) d x 2

将其推广到n项,可以得到第二项
Mi=1qi(Zi)lnMi=1qi(Zi)dZ ∫ ∏ i = 1 M q i ( Z i ) l n ∏ i = 1 M q i ( Z i ) d Z
=Mi=1qi(Zi)Mi=1lnqi(Zi)dZ = ∫ ∏ i = 1 M q i ( Z i ) ∑ i = 1 M l n q i ( Z i ) d Z
=Mi=1ziqi(Zi)ln(q(zi)dZi = ∑ i = 1 M ∫ z i q i ( Z i ) l n ( q ( z i ) d Z i

若只关注第j 维度,第二项可记作: Zjqj(Zj)ln(qj(Zj))dZj+const. ∫ Z j q j ( Z j ) l n ( q j ( Z j ) ) d Z j + c o n s t .

由此可得
(qj)=zjqj(Zj)Eijln(p(X,Z))dZjZjqj(Zj)ln(qj(Zj))dZj+const L ( q j ) = ∫ z j q j ( Z j ) E i ≠ j l n ( p ( X , Z ) ) d Z j − ∫ Z j q j ( Z j ) l n ( q j ( Z j ) ) d Z j + c o n s t

该极值在KL散度为0时最大,因此更新公式为:
ln(qij(Zi))=Eijln(p(X,Z)) l n ( q i ≠ j ∗ ( Z i ) ) = E i ≠ j l n ( p ( X , Z ) )

方式二:指数家族分布

首先,要引出指数分布族的概念。它的标准表达式为
h(x)exp(T(x)TηA(η)) h ( x ) e x p ( T ( x ) T η − A ( η ) )

其中 η η 是分布的自然参数(natural parameter)或典范参数(canonical parameter), T(x) T ( x ) 叫做充分统计量(sufficient statistic),通常情况下 T(x)=x T ( x ) = x , A(η) A ( η ) 是对数分配函数(log partition function)。

对指数家族,首先给个公式 EqT(x)=ηA(η) E q T ( x ) = ∇ η A ( η ) , 证明如下:
h(x)exp(T(x)TηA(η))=1 ∫ h ( x ) e x p ( T ( x ) T η − A ( η ) ) = 1
η(h(x)exp(T(x)TηA(η))dx)=0 ∇ η ( ∫ h ( x ) e x p ( T ( x ) T η − A ( η ) ) d x ) = 0
=> ηh(x)exp(T(x)TηA(η))dx=0 ∫ ∇ η h ( x ) e x p ( T ( x ) T η − A ( η ) ) d x = 0
=> h(x)exp(T(x)TηA(η))T(x)(h(x)exp(T(x)TηA(η))η(A(η)) ∫ h ( x ) e x p ( T ( x ) T η − A ( η ) ) T ( x ) − ( ∫ h ( x ) e x p ( T ( x ) T η − A ( η ) ) ∇ η ( A ( η ) )
第二项前面是个概率密度积分 为1
=> EqT(x)=ηA(η) E q T ( x ) = ∇ η A ( η )

同样的我们需要最大化ELBO,选择的q依然满足 q(β,Z)=q(β)q(Z) q ( β , Z ) = q ( β ) q ( Z ) , 更新时,fix某一个参数,计算如下

计算 λ λ 保留和 β β 相关的
(λ,ϕ)=EqZ,β[log(p(x,Z,β))]EqZ,β[log(q(Z,β))] L ( λ , ϕ ) = E q Z , β [ l o g ( p ( x , Z , β ) ) ] − E q Z , β [ l o g ( q ( Z , β ) ) ]
=EqZ,β[log(p(β|Z,x))+log(p(Z,x))]EqZ[log(q(Z)Eqβ[log(q(β)] = E q Z , β [ l o g ( p ( β | Z , x ) ) + l o g ( p ( Z , x ) ) ] − E q Z [ l o g ( q ( Z ) − E q β [ l o g ( q ( β ) ] 去除于当前所求无关项目
=EqZ,β[log(p(β|Z,x))]Eqβ[log(q(β)] = E q Z , β [ l o g ( p ( β | Z , x ) ) ] − E q β [ l o g ( q ( β ) ]

将以上两部分利用指数家族分布 代换
=EqZ,β[logh(β)]+EqZ,β[TT(β)η(Z,x)]EqZ,β[Ag(η(Z,x))]EqZ,βlog(h(β))EqZ,β[TT(β)λ]+EqZ,β[Al(λ)] = E q Z , β [ l o g h ( β ) ] + E q Z , β [ T T ( β ) η ( Z , x ) ] − E q Z , β [ A g ( η ( Z , x ) ) ] − E q Z , β l o g ( h ( β ) ) − E q Z , β [ T T ( β ) λ ] + E q Z , β [ A l ( λ ) ]

将与 λ λ 无关的去除可得
EqZ,β[TT(β)η(Z,x)]EqZ,β[TT(β)λ]+EqZ,β[Al(λ)] E q Z , β [ T T ( β ) η ( Z , x ) ] − E q Z , β [ T T ( β ) λ ] + E q Z , β [ A l ( λ ) ]

EqT(x)=ηA(η) E q T ( x ) = ∇ η A ( η ) 带入,易得
λ=E(qϕ[ηg(x,Z,α)]) λ = E ( q ϕ [ η g ( x , Z , α ) ] ) 即除当前参数以外的所有参数的期望。

### Gaussian Mixture Models (GMMs): EM Algorithm versus Variational Inference In the context of machine learning, both Expectation-Maximization (EM) algorithms and variational inference serve as powerful tools for parameter estimation within probabilistic models such as Gaussian mixture models (GMMs). However, these methods differ significantly in their approach to handling uncertainty. #### The Expectation-Maximization (EM) Algorithm The EM algorithm is an iterative method used primarily when dealing with incomplete data or latent variables. It alternates between two steps until convergence: - **E-step**: Compute the expected value of the log likelihood function concerning unobserved data given current estimates. - **M-step**: Maximize this expectation over parameters to find new values that increase the probability of observing the training set[^2]. For GMMs specifically, during each iteration, the E-step calculates responsibilities indicating how likely it is for a point to belong to any particular cluster; meanwhile, the M-step updates means, covariances, and mixing coefficients based on those computed probabilities. ```python from sklearn.mixture import GaussianMixture gmm_em = GaussianMixture(n_components=3, covariance_type='full') gmm_em.fit(X_train) ``` #### Variational Inference Approach Variational inference takes a different path by approximating complex posterior distributions through optimization rather than sampling techniques like Markov Chain Monte Carlo (MCMC). This approximation involves constructing a simpler family of densities—often referred to as "variational distribution"—and finding its member closest to the true posterior according to Kullback-Leibler divergence criteria[^1]. When applied to GMMs, instead of directly computing exact posteriors which might be computationally prohibitive due to high dimensionality or large datasets, one defines a parametric form q(z|x), where z represents hidden states while x denotes observed features. Then optimize parameters so that KL[q||p] becomes minimal possible under chosen constraints. ```python import tensorflow_probability as tfp tfd = tfp.distributions model = tfd.JointDistributionSequential([ # Prior p(pi) tfd.Dirichlet(concentration=[alpha]*num_clusters), lambda pi: tfd.Sample( tfd.Normal(loc=tf.zeros([dim]), scale=tf.ones([dim])), sample_shape=num_clusters, name="means" ), ]) ``` #### Key Differences & Applications While both approaches aim at inferring unknown quantities from noisy observations, they exhibit distinct characteristics making them suitable for various scenarios: - **Computational Efficiency:** Generally speaking, EM tends to converge faster but may get stuck into local optima more easily compared to VI whose global search capability can sometimes lead to better solutions albeit slower computation time. - **Flexibility:** Due to reliance upon specific assumptions about underlying structure, traditional EM implementations are less flexible regarding model specification changes whereas Bayesian nonparametrics paired with VI offer greater adaptability without sacrificing much performance. - **Uncertainty Quantification:** One significant advantage offered by VI lies in providing full density functions over learned parameters thus enabling richer interpretations beyond mere point estimates provided typically via maximum likelihood estimators employed inside standard EM procedures. --related questions-- 1. How does the choice between EM and VI impact real-world applications involving massive datasets? 2. Can you provide examples illustrating situations favoring either technique over another? 3. What modifications could enhance classical EM's robustness against poor initialization issues commonly encountered? 4. Are there hybrid strategies combining strengths of both methodologies worth exploring further?
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值