文章目录
观看本文之前建议先观看以下两篇文章:
Training Loss
训练目标
首先回顾一下我们的问题,我们在逆向降噪过程中由于没办法得到 q ( x t − 1 ∣ x t ) q(\mathbf{x}_{t-1} \vert \mathbf{x}_{t}) q(xt−1∣xt),因此我们定义了一个 需要学习的模型模型 p θ ( x t − 1 ∣ x t ) p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) pθ(xt−1∣xt)来对其进行近似,并且在训练阶段我们可以利用后验 q ( x t − 1 ∣ x t , x 0 ) q(\mathbf{x}_{t-1}\vert \mathbf{x}_t,\mathbf{x}_0) q(xt−1∣xt,x0)来对 p θ p_\theta pθ进行优化。
那么现在的问题是我们如何 p θ p_\theta pθ优化得到理想的 μ θ \boldsymbol{\mu}_\theta μθ和 Σ θ \boldsymbol{\Sigma}_\theta Σθ?类似于 VAE ,我们可以最小化在真实数据期望下,模型预测分布的负对数似然,即最小化预测 p d a t a = q ( x 0 ) p_{\mathrm{data}}=q({\mathbf{x}_0}) pdata=q(x0)和 p θ ( x 0 ) p_{\theta}(\mathbf{x}_0) pθ(x0)的交叉熵:
L = E x 0 ∼ q ( x 0 ) [ − log p θ ( x 0 ) ] \begin{equation} \mathcal{L}=\mathbb{E}_{\mathbf{x}_{0} \sim q\left(\mathbf{x}_{0}\right)}\left[-\log p_{\theta}\left(\mathbf{x}_{0}\right)\right] \end{equation} L=Ex0∼q(x0)[−logpθ(x0)]
但是,我们没法得到 p θ ( x 0 ) p_\theta(\mathbf{x}_0) pθ(x0)的表达式,因此公式1的交叉熵是没法计算的。那么可以借助公式Diffusion Model(2):前向扩散过程和逆向降噪过程
2-6 进行一些数学推导。将公式1中的 p θ ( x 0 ) p_\theta(\mathbf{x}_0) pθ(x0)转化为已知的项:
L = − E q ( x 0 ) log p θ ( x 0 ) = − E q ( x 0 ) log ( ∫ p θ ( x 0 : T ) d x 1 : T ) = − E q ( x 0 ) log ( ∫ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) d x 1 : T ) = − E q ( x 0 ) log ( E q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ) ≤ − E q ( x 0 : T ) log p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) = E q ( x 0 : T ) [ log q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = L V L B \begin{equation} \begin{aligned} \mathcal{L} &=-\mathbb{E}_{q\left(\mathbf{x}_{0}\right)} \log p_{\theta}\left(\mathbf{x}_{0}\right) \\ &=-\mathbb{E}_{q\left(\mathbf{x}_{0}\right)} \log \left(\int p_{\theta}\left(\mathbf{x}_{0: T}\right) d \mathbf{x}_{1: T}\right) \\ &=-\mathbb{E}_{q\left(\mathbf{x}_{0}\right)} \log \left(\int q\left(\mathbf{x}_{1: T} \vert \mathbf{x}_{0}\right) \frac{p_{\theta}\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \vert \mathbf{x}_{0}\right)} d \mathbf{x}_{1: T}\right) \\ &=-\mathbb{E}_{q\left(\mathbf{x}_{0}\right)} \log \left(\mathbb{E}_{q\left(\mathbf{x}_{1: T} \vert \mathbf{x}_{0}\right)} \frac{p_{\theta}\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \vert \mathbf{x}_{0}\right)}\right) \\ & \leq-\mathbb{E}_{q\left(\mathbf{x}_{0: T}\right)} \log \frac{p_{\theta}\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \vert \mathbf{x}_{0}\right)} \\ &=\mathbb{E}_{q\left(\mathbf{x}_{0: T}\right)}\left[\log \frac{q\left(\mathbf{x}_{1: T} \vert \mathbf{x}_{0}\right)}{p_{\theta}\left(\mathbf{x}_{0: T}\right)}\right]=\mathcal{L}_{\mathrm{VLB}} \end{aligned} \end{equation} L=−Eq(x0)logpθ(x0)=−Eq(x0)log(∫pθ(x0:T)dx1:T)=−Eq(x0)log(∫q(x1:T∣x0)q(x1:T∣x0)pθ(x0:T)dx1:T)=−Eq(x0)log(Eq(x1:T∣x0)q(x1:T∣x0)pθ(x0:T))≤−Eq(x0:T)logq(x1:T∣x0)pθ(x0:T)=Eq(x0:T)[logpθ(x0:T)q(x1:T∣x0)]=LVLB
上式中 q ( x 0 ) q(\mathbf{x}_0) q(x0)是真实的数据分布,而 p θ ( x 0 ) p_\theta(\mathbf{x}_0) pθ(x0)是模型,从第四行到第五行使用了Jensen不等式 log E [ f ( x ) ] ≤ E [ log f ( x ) ] \log \mathbb{E}[f(x)] \leq \mathbb{E}[\log f(x)] logE[f(x)]≤E[logf(x)]并结合了对 q ( x 0 ) q(\mathbf{x}_0) q(x0)的期望和对 q ( x 1 : T ∣ x 0 ) q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) q(x1:T∣x0)的期望。
为了最小化这个损失,结合公式2可以将其转化为最小化其上界 L V L B \mathcal{L}_{\mathrm{VLB}} LVLB:
L V L B = E q ( x 0 : T ) [ log q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] = E q [ log ∏ t = 1 T q ( x t ∣ x t − 1 ) p θ ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ] = E q [ − log p θ ( x T ) + ∑ t = 1 T log q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) ] = E q [ − log p θ ( x T ) + ∑ t = 2 T log q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) + log q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − log p θ ( x T ) + ∑ t = 2 T log ( q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) ⋅ q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ) + log q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − log p θ ( x T ) + ∑ t = 2 T log q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + ∑ t = 2 T log q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) + log q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] = E q [ − log p θ ( x T ) + ∑ t = 2 T log q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + log q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) + log