变分推断

最新推荐文章于 2025-03-05 14:37:03 发布

疯女孩爱飞

最新推荐文章于 2025-03-05 14:37:03 发布

阅读量1.2k

点赞数 1

分类专栏：学习笔记文章标签：变分指数族

本文链接：https://blog.youkuaiyun.com/ustbfym/article/details/79172976

版权

学习笔记专栏收录该内容

18 篇文章

订阅专栏

变分推断

变分推断就是用简单的分布q去近似复杂的分布p。

以上这句话是对变分推断最直接的理解。那么，为什么要选择用变分推断？

因为，大多数情况下后验分布很难求。如果后验概率好求解的话我们直接采用EM就可以。当后验分布难于求解的时候我们就希望选择一些简单的分布来近似这些复杂的后验分布。特别的，如果我们选择的q是指数族内的分布，更易于积分求解。

Evidence Lower Bound(ELOB)

由贝叶斯公式可得 $ln(p(X)) = ln(p(X,Z) - ln(p(Z|X))$ 继续推导如下：

$ln(p(X)) = ln(p(X,Z) - ln(p(Z|X))$
$ln(p(X)) = ln(p(X,Z) -ln(q(Z))- (ln(p(Z|X)) - ln(q(Z)))$
$ln(p(X)) = ln(\frac{P(X,Z)}{q(Z)}) - ln(\frac{p(Z|X)}{ln(q(Z)})$
以 $q(Z)$ 为概率密度对其积分：
$ln(p(X)) = \int q(Z) ln(\frac{P(X,Z)}{q(Z)}) dZ - \int q(Z) ln(\frac{p(Z|X)}{ln(q(Z)})dZ$
$ln(p(X)) = \int q(Z)ln(p(X,Z)) dZ- \int q(Z) ln q(Z)dZ- \int q(Z) ln(\frac{p(Z|X)}{ln(q(Z)})dZ$

其中 $\int q(Z)ln(p(X,Z)) dZ- \int q(Z) ln(q(Z))dZ$ 为 $\mathcal{L}(q)$ ELOB , $\int q(Z) ln(\frac{p(Z|X)}{ln(q(Z)})dZ$ 为 $\mathcal{KL}(q//p)$ 。

我们要做的就是不断更新 $q(Z)$ 使得 $q(Z)$ 不断的接近 $p$ 分布，这样他们的分布距离越小，Lower bound则越大。

另一个问题，如何选择 $q(Z)$ ？

$q(Z)$ 的选择

方式一：选择各维度独立 $q(Z) = \prod_{i=1} ^M q_i({Z_i})$

$\mathcal{L}(q) = \int q(Z)ln(p(X,Z)) dZ- \int q(Z) ln(q(Z))dZ$
$= \prod_{i=1} ^M q_i({Z_i})ln(p(X,Z)) dZ- \int \prod_{i=1} ^M q_i({Z_i}) ln \prod_{i=1} ^M q_i({Z_i})dZ$

考虑第一项：
$\prod_{i=1} ^M q_i({Z_i})ln(p(X,Z)) dZ$
$=\int_{z_1}\int_{z_2}..\int_{z_M}\prod_{i=1} ^M q_i({Z_i})ln(p(X,Z)) dz_1 dz_2...dz_M$
$=\int_{z_j} q_j(Z_j) (\int_{z_{i \neq j}} ...\int \prod_{i \neq j}^M q_i(Z_i) lnp(X,Z) \prod_{i \neq j}^M d_{Z_i})d_{Z_j}$
$=\int_{z_j} q_j(Z_j) (\int_{z_{i \neq j}} ...\int lnp(X,Z) \prod_{i \neq j}^M q_i(Z_i)d_{Z_i})d_{Z_j}$
$=\int_{z_j} q_j(Z_j) E_{i \neq j} ln(p(X,Z))d_{Z_j}$

考虑第二项：
在推导第二项之前：
$\int_{x_1} \int_{x_2}(f(x_1) + f(x_2)) p(x_1, x_2) dx_1 dx_2$
$= \int_{x_1} f(x_1) dx_1 + \int_{x_2} f(x_2) dx_2$

将其推广到n项，可以得到第二项
$\int \prod_{i=1} ^M q_i({Z_i}) ln \prod_{i=1} ^M q_i({Z_i})dZ$
$=\int \prod_{i=1} ^M q_i({Z_i}) \sum_{i=1} ^M ln q_i({Z_i})dZ$
$=\sum_{i=1} ^M \int_{z_i}q_i({Z_i} )ln(q(z_i)dZ_i$

若只关注第j 维度，第二项可记作： $\int_{Z_j} q_j(Z_j) ln(q_j(Z_j))dZ_j+ const.$

由此可得
$\mathcal{L}(q_j) =\int_{z_j} q_j(Z_j) E_{i \neq j} ln(p(X,Z))d_{Z_j} - \int_{Z_j} q_j(Z_j) ln(q_j(Z_j))dZ_j+ const$

该极值在KL散度为0时最大，因此更新公式为：
$ln_(q_{i \neq j}^*(Z_i)) = E_{i \neq j} ln(p(X,Z))$

方式二：指数家族分布

首先，要引出指数分布族的概念。它的标准表达式为
$h(x) exp(T(x)^T \eta - A(\eta))$

其中 $\eta$ 是分布的自然参数（natural parameter）或典范参数（canonical parameter）， $T(x)$ 叫做充分统计量（sufficient statistic），通常情况下 $T(x)=x$ , $A(\eta)$ 是对数分配函数（log partition function）。

对指数家族，首先给个公式 $E_q T(x) = \nabla_\eta A(\eta)$ , 证明如下：
$\int h(x) exp(T(x)^T \eta - A(\eta)) = 1$
$\nabla_\eta(\int h(x) exp(T(x)^T \eta - A(\eta))dx ) = 0$
=> $\int \nabla_\eta h(x) exp(T(x)^T \eta - A(\eta)) dx =0$
=> $\int h(x) exp(T(x)^T \eta - A(\eta)) T(x) - (\int h(x) exp(T(x)^T \eta - A(\eta))\nabla_\eta (A(\eta))$
第二项前面是个概率密度积分为1
=> $E_q T(x) = \nabla_\eta A(\eta)$

同样的我们需要最大化ELBO，选择的q依然满足 $q(\beta, Z) = q(\beta) q(Z)$ ，更新时，fix某一个参数，计算如下

计算 $\lambda$ 保留和 $\beta$ 相关的
$\mathcal{L}(\lambda, \phi) = E_{q_{Z, \beta}}[log(p(x,Z,\beta))] - E_{q_{Z, \beta}}[log(q(Z, \beta))]$
$=E_{q_{Z, \beta}}[log(p(\beta|Z, x)) + log(p(Z,x))] -E_{q_{Z}}[log(q(Z)-E_{q_{\beta}}[log(q(\beta)]$ 去除于当前所求无关项目
$=E_{q_{Z, \beta}}[log(p(\beta|Z, x))] -E_{q_{\beta}}[log(q(\beta)]$

将以上两部分利用指数家族分布代换
$=E_{q_{Z, \beta}}[log h(\beta)] + E_{q_{Z, \beta}}[T^T(\beta) \eta(Z,x)] - E_{q_{Z, \beta}}[A_g(\eta(Z,x))] - E_{q_{Z, \beta}}log(h(\beta)) - E_{q_{Z, \beta}} [T^T(\beta) \lambda] + E_{q_{Z, \beta}}[A_l(\lambda)]$

将与 $\lambda$ 无关的去除可得
$E_{q_{Z, \beta}}[T^T(\beta) \eta(Z,x)] -E_{q_{Z, \beta}} [T^T(\beta) \lambda] + E_{q_{Z, \beta}}[A_l(\lambda)]$

将 $E_q T(x) = \nabla_\eta A(\eta)$ 带入，易得
$\lambda = E(q_\phi[\eta_g(x,Z,\alpha)])$ 即除当前参数以外的所有参数的期望。

确定要放弃本次机会？
福利倒计时
: :

立减 ¥
普通VIP年卡可用
立即使用

疯女孩爱飞

关注关注

1
点赞

踩

3

收藏

觉得还不错? 一键收藏

0
评论

分享

复制链接

分享到 QQ

分享到新浪微博

扫一扫

举报

举报

专栏目录

变分推断（Variational Inference）解析

u013602059的专栏

12-01 6190

假设在一个贝叶斯模型中，xxx为一组观测变量，zzz为一组隐变量（参数也看做随机变量，包含在zzz中），则推断问题为计算后验概率密度P=(z∣x)P=(z|x)P=(z∣x)。根据贝叶斯公式，有： p(z∣x)=p(x,z)p(x)=p(x,z)∫p(x,z)dzp(z|x)=\frac{p(x,z)}{p(x)}=\frac{p(x,z)}{\int p(x,z)dz}p(z∣x)=p(x)p(x,z)=∫p(x,z)dzp(x,z) 但是在实际应用中，可能由于积分没有闭式解，或者是指数级的计算复杂度

变分推断（Variational Inference）

热门推荐

咔咔响

01-04 1万+

1.变分推断简称VI，是一种确定性近似推断方法2.基于平均场理论的VI是假设q(z)可以分解为M个独立qi3.采取坐标上升法可以求解VI问题4.VI有两个局限：假设太强，同时积分也不一定...

参与评论您还未登录，请先登录后发表或查看评论

变分扩散模型中的 Evidence Lower Bound (ELBO) 详解

最新发布

阿正的梦工坊

03-05 598

本文将详细解析 ELBO 的每个组成部分及其数学意义

变分推断~

weixin_45683677的博客

01-28 1286

文章目录数据X与模型Z的关系基于平均场的变分推断基于梯度的变分推断重参数化数据X与模型Z的关系从模型到数据就是生成模型，其作用是去预估生成数据的分布从数据到模型是推断模型，其作用是根据数据去得到因变量基于平均场的变分推断 基于平均场将隐变量z分为若干个独立的团，这样求起来还是需要积分，其思想类似于坐标上升。基于梯度的变分推断 这里补充一点，对于变分推断来说，基于的公式也是ELBO与KL散度，总体来说就是让q(z)q(z)q(z)去近似估计p(z∣x)p(z|x)p(z∣x)这个后验。我们

变分推断原理1

08-03

变分推断是一种在贝叶斯统计和机器学习中用于解决复杂概率模型中推断问题的方法，特别是当模型包含难以处理的隐变量或参数时。它主要针对两大挑战：一是后验概率的计算通常涉及困难的积分或求和操作；二是随着数据量...

机器学习笔记之变分推断(四)随机梯度变分推断(SGVI)

静静的学习就好

09-18 2080

上一节介绍了基于平均场假设的变分推断与广义EM算法的关系，本节将介绍通过随机梯度的思想实现变分推断。

详细讲解一下变分推断

探索人工智能革命，深入算法原理与创新应用，未来科技无限可能。

12-06 1210

qz∣xq(z|x)qz∣x选择一个易于处理的分布族（例如高斯分布）。通过优化qz∣xq(z|x)qz∣x的参数，使其尽可能接近pz∣xp(z|x)pz∣x。

Python-PyTorch自编码变分推断主题模型

08-09

**Python-PyTorch自编码变分推断主题模型** 在现代数据科学中，主题模型是一种强大的工具，用于从非结构化的文本数据中提取隐藏的主题或概念。自编码器（Autoencoder）和变分推断（Variational Inference）是机器...

基于扩散方法的分布式随机变分推断算法.pdf

08-08

基于扩散方法的分布式随机变分推断算法是针对分布式网络上的数据聚类、估计或推断设计的一种高效算法。在大数据时代，数据往往分散存储于多个节点上，每个节点只拥有部分数据。这样的分布式存储环境具有几个显著特点...

变分推断（variational inference）

qq_40823914的博客

12-03 1038

变分推断（variational inference）原创 2017年02月12日 12:13:49 标签： 2796 编辑删除 variational inference 大家对贝叶斯公式应该都很熟悉 P(Z|X)=p(X,Z)∫zp(X,Z=z)dz 我们称P(Z|X)为posterior distribution。pos

高斯混合模型EM算法和变分推断

02-03

### Gaussian Mixture Models (GMMs): EM Algorithm versus Variational Inference In the context of machine learning, both Expectation-Maximization (EM) algorithms and variational inference serve as powerful tools for parameter estimation within probabilistic models such as Gaussian mixture models (GMMs). However, these methods differ significantly in their approach to handling uncertainty. #### The Expectation-Maximization (EM) Algorithm The EM algorithm is an iterative method used primarily when dealing with incomplete data or latent variables. It alternates between two steps until convergence: - **E-step**: Compute the expected value of the log likelihood function concerning unobserved data given current estimates. - **M-step**: Maximize this expectation over parameters to find new values that increase the probability of observing the training set[^2]. For GMMs specifically, during each iteration, the E-step calculates responsibilities indicating how likely it is for a point to belong to any particular cluster; meanwhile, the M-step updates means, covariances, and mixing coefficients based on those computed probabilities. ```python from sklearn.mixture import GaussianMixture gmm_em = GaussianMixture(n_components=3, covariance_type='full') gmm_em.fit(X_train) ``` #### Variational Inference Approach Variational inference takes a different path by approximating complex posterior distributions through optimization rather than sampling techniques like Markov Chain Monte Carlo (MCMC). This approximation involves constructing a simpler family of densities—often referred to as "variational distribution"—and finding its member closest to the true posterior according to Kullback-Leibler divergence criteria[^1]. When applied to GMMs, instead of directly computing exact posteriors which might be computationally prohibitive due to high dimensionality or large datasets, one defines a parametric form q(z|x), where z represents hidden states while x denotes observed features. Then optimize parameters so that KL[q||p] becomes minimal possible under chosen constraints. ```python import tensorflow_probability as tfp tfd = tfp.distributions model = tfd.JointDistributionSequential([ # Prior p(pi) tfd.Dirichlet(concentration=[alpha]*num_clusters), lambda pi: tfd.Sample( tfd.Normal(loc=tf.zeros([dim]), scale=tf.ones([dim])), sample_shape=num_clusters, name="means" ), ]) ``` #### Key Differences & Applications While both approaches aim at inferring unknown quantities from noisy observations, they exhibit distinct characteristics making them suitable for various scenarios: - **Computational Efficiency:** Generally speaking, EM tends to converge faster but may get stuck into local optima more easily compared to VI whose global search capability can sometimes lead to better solutions albeit slower computation time. - **Flexibility:** Due to reliance upon specific assumptions about underlying structure, traditional EM implementations are less flexible regarding model specification changes whereas Bayesian nonparametrics paired with VI offer greater adaptability without sacrificing much performance. - **Uncertainty Quantification:** One significant advantage offered by VI lies in providing full density functions over learned parameters thus enabling richer interpretations beyond mere point estimates provided typically via maximum likelihood estimators employed inside standard EM procedures. --related questions-- 1. How does the choice between EM and VI impact real-world applications involving massive datasets? 2. Can you provide examples illustrating situations favoring either technique over another? 3. What modifications could enhance classical EM's robustness against poor initialization issues commonly encountered? 4. Are there hybrid strategies combining strengths of both methodologies worth exploring further?