贝叶斯推断中的边际似然-变分推断（公式推导）

最新推荐文章于 2025-12-02 19:20:40 发布

原创最新推荐文章于 2025-12-02 19:20:40 发布 · 622 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#算法

【投稿赢 iPhone 17】「我的第一个开源项目」故事征集：用代码换C位出道！ 10w+人浏览 1.7k人参与

贝叶斯推断中的边际似然-变分推断

边际似然的背景

在贝叶斯框架下，我们有：

$p(θ∣D)=p(D∣θ)p(θ)p(D)(1)p(\theta|\mathcal{D})=\frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})}\qquad{(1)}$

其中分母：

$p(D)=∫p(D∣θ)p(θ)dθp(\mathcal{D})=\int p(\mathcal{D}|\theta)p(\theta)d\theta$

是边际似然，反应模型对数据的解释能力。但是这个积分通常是高维的、没有解析形式，因此我们需要：

采样近似（如MCMC）
优化近似（如Variational Inference）

变分推断基本思想

我们可以使用一个可计算的分布 $q(θ)q(\theta)$ 来近似后验分布 $p(θ∣D)p(\theta|\mathcal{D})$ 。

我们希望 $q(θ)q(\theta)$ 尽量接近真实后验，也就是最小化KL散度：

$KL(q(θ)∣∣p(θ∣D))=∫q(θ)log⁡q(θ)p(θ∣D)dθ\mathrm{KL}(q(\theta)||p(\theta|\mathcal{D}))=\int q(\theta)\log\frac{q(\theta)}{p(\theta|\mathcal{D})}d\theta$

由于 $p(D)p(\mathcal{D})$ 是常数（与 $θ\theta$ 无关），我们无法直接最小化它，但是可以通过下式进行变形：

$log⁡p(D)=L(q)+KL(q(θ)∣∣p(θ∣D))(2)\log p(\mathcal{D})=\mathcal{L}(q)+\mathrm{KL}(q(\theta)||p(\theta|\mathcal{D}))\quad\quad{(2)}$

因此：

$L(q)=∫q(θ)log⁡p(D,θ)q(θ)dθ\mathcal{L}(q)=\int q(\theta)\log\frac{p(\mathcal{D},\theta)}{q(\theta)}d\theta$ 是证据下限（Evidence Lower Bound, ELBO)。因此最大化ELBO $⇔\Leftrightarrow$ 最小化KL散度。

推导过程

我们对贝叶斯公式(1)两边取对数并取期望（期望分布为 $q(θ)q(\theta)$ )：

$Eq[log⁡p(θ∣D)]=Eq[log⁡p(D,θ)]−log⁡p(D)E_q[\log p(\theta|\mathcal{D})]=E_q[\log p(\mathcal{D},\theta)]-\log p(\mathcal{D})$

$log⁡p(D)=log⁡∫p(D,θ)dθ\log p(\mathcal{D})=\log\int p(\mathcal{D},\theta)d\theta$

引入 $q(θ)q(\theta)$

$log⁡p(D)=log⁡∫q(θ)p(D,θ)q(θ)dθ\log p(\mathcal{D})=\log\int q(\theta)\frac{p(\mathcal{D},\theta)}{q(\theta)}d\theta$

利用Jensen不等式：

$log⁡p(D)≥∫q(θ)log⁡p(D,θ)q(θ)dθ=L(q)\log p(\mathcal{D})\geq\int q(\theta)\log\frac{p(\mathcal{D},\theta)}{q(\theta)}d\theta=\mathcal{L}(q)$

KL散度的定义为：

$KL(q(θ)∥p(θ∣D))=∫q(θ)log⁡q(θ)p(θ∣D)dθ(3)\mathrm{KL}(q(\theta)\|p(\theta|\mathcal{D}))=\int q(\theta)\log\frac{q(\theta)}{p(\theta|\mathcal{D})}d\theta\qquad{(3)}$

$KL(q(θ)∥p(θ∣D))=∫q(θ)(log⁡q(θ)−log⁡p(θ∣D))dθ=∫q(θ)log⁡q(θ)dθ⏟Eq[log⁡q(θ)]−∫q(θ)log⁡p(θ∣D)dθ⏟Eq[log⁡p(θ∣D)](4)\mathrm{KL}(q(\theta)\|p(\theta|\mathcal{D}))=\int q(\theta)\left(\log q(\theta)-\log p(\theta\mid\mathcal{D})\right)d\theta=\underbrace{\int q(\theta)\log q(\theta)d\theta}_{E_q[\log q(\theta)]}-\underbrace{\int q(\theta)\log p(\theta\mid\mathcal{D})d\theta}_{E_q[\log p(\theta|\mathcal{D})]}\qquad{(4)}$

由公式(1)可以得到 $log⁡p(θ∣D)=log⁡p(D,θ)−log⁡p(D)\log p(\theta\mid\mathcal{D})=\log p(\mathcal{D},\theta)-\log p(\mathcal{D})$ 。将其带入公式(4)的第二项：

$Eq[log⁡p(θ∣D)]=∫q(θ)(log⁡p(D,θ)−log⁡p(D))dθ.E_q[\log p(\theta\mid\mathcal{D})]=\int q(\theta)\left(\log p(\mathcal{D},\theta)-\log p(\mathcal{D})\right)d\theta.$

因为 $log⁡p(D)\log p(\mathcal{D})$ 与 $θ\theta$ 无关，且 $∫q(θ)dθ=1\int q(\theta)d\theta=1$ ，得到：

$Eq[log⁡p(θ∣D)]=Eq[log⁡p(D,θ)]−log⁡p(D)(5)E_q[\log p(\theta\mid\mathcal{D})]=E_q[\log p(\mathcal{D},\theta)]-\log p(\mathcal{D})\qquad{(5)}$

将公式(5)代入KL散度（公式(4)）的展开式，整理得到：

$KL(q∥p)=log⁡p(D)−(Eq[log⁡p(D,θ)]−Eq[log⁡q(θ)])(6)\mathrm{KL}(q\|p)=\log p(\mathcal{D})-\left(E_q[\log p(\mathcal{D},\theta)]-E_q[\log q(\theta)]\right)\qquad{(6)}$