扩散模型(Diffusion Model)数学推导(含详细注解)

扩散模型(Diffusion Model)数学推导

1. 基本概念

扩散模型是一种基于马尔可夫链的生成模型,包含两个核心过程:

1.1 扩散过程(前向过程)

逐步向数据添加高斯噪声,将数据分布转化为标准正态分布:

q ( x 1 : T ∣ x 0 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1}) q(x1:Tx0)=t=1Tq(xtxt1)

1.2 逆扩散过程(反向过程)

学习从噪声中重建原始数据:

p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}|x_t) pθ(x0:T)=p(xT)t=1Tpθ(xt1xt)

2. 前向过程详细推导

2.1 单步扩散

每一步的扩散过程定义为:

q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t \boldsymbol{I}) q(xtxt1)=N(xt;1βt xt1,βtI)

使用重参数化技巧表示为:

x t = α t x t − 1 + 1 − α t ϵ t − 1 , α t = 1 − β t x_t = \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t}\epsilon_{t-1}, \quad \alpha_t = 1-\beta_t xt=αt xt1+1αt ϵt1,αt=1βt

2.2 任意时刻表示

通过递归展开可得:

x t = α ˉ t x 0 + 1 − α ˉ t ϵ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon xt=αˉt x0+1αˉt ϵ

其中: α ˉ t = ∏ i = 1 t α i \bar{\alpha}_t = \prod_{i=1}^t \alpha_i αˉt=i=1tαi, ϵ ∼ N ( 0 , I ) \epsilon \sim \mathcal{N}(0, \boldsymbol{I}) ϵN(0,I)

递归推导:
x t = α t x t − 1 + 1 − α t ϵ t − 1 = α t ( α t − 1 x t − 2 + 1 − α t − 1 ϵ t − 2 ) + 1 − α t ϵ t − 1 = α t α t − 1 x t − 2 + α t ( 1 − α t − 1 ) ϵ t − 2 + 1 − α t ϵ t − 1 = α t α t − 1 x t − 2 + 1 − α t − 1 α t ϵ ˉ t − 2 ⋯ = α ˉ t x 0 + 1 − α ˉ t ϵ \begin{aligned} {x}_t & =\sqrt{\alpha_t} {x}_{t-1}+\sqrt{1-\alpha_t} {\epsilon}_{t-1} \\ & =\sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}} x_{t-2}+\sqrt{1-\alpha_{t-1}} \epsilon_{t-2}\right)+\sqrt{1-\alpha_t} \epsilon_{t-1} \\ & =\sqrt{\alpha_t \alpha_{t-1}} x_{t-2}+\sqrt{\alpha_t\left(1-\alpha_{t-1}\right)} \epsilon_{t-2}+\sqrt{1-\alpha_t} \epsilon_{t-1} \\ & =\sqrt{\alpha_t \alpha_{t-1}} x_{t-2}+ \textcolor{blue}{\sqrt{1-\alpha_{t-1} \alpha_t} \bar{\epsilon}_{t-2}} \\ & \cdots \\ & =\sqrt{\bar{\alpha}_t} {x}_0+\sqrt{1-\bar{\alpha}_t} {\epsilon} \end{aligned} xt=αt xt1+1αt ϵt1=αt (αt1 xt2+1αt1 ϵt2)+1αt ϵt1=αtαt1 xt2+αt(1αt1) ϵt2+1αt ϵt1=αtαt1 xt2+1αt1αt ϵˉt2=αˉt x0+1αˉt ϵ
注解: ϵ t − 1 \epsilon_{t-1} ϵt1 ϵ t − 2 \epsilon_{t-2} ϵt2是相互独立的高斯分布,因此可以合成一个新的高斯分布 ϵ ˉ t − 2 \bar{\epsilon}_{t-2} ϵˉt2,标准差为 1 − α t − 1 α t \sqrt{1-\alpha_{t-1} \alpha_t} 1αt1αt 。参考公式:给定两个独立的正态分布 X 1 ∼ N ( μ 1 , σ 1 2 ) X_1 \sim N\left(\mu_1, \sigma_1^2\right) X1N(μ1,σ12) X 2 ∼ N ( μ 2 , σ 2 2 ) X_2 \sim N\left(\mu_2, \sigma_2^2\right) X2N(μ2,σ22) ,且 a b a b ab 均为实数则
a X 1 + b X 2 ∼ N ( a μ 1 + b μ 2 , a 2 σ 1 2 + b 2 σ 2 2 ) {aX_1}+{bX_2} \sim N\left({a} \mu_1+b \mu_2, {a}^2 \sigma_1^2+b^2 \sigma_2^2\right) aX1+bX2N(aμ1+bμ2,a2σ12+b2σ22)

2.3 性质分析

T → ∞ T \to \infty T时: α ˉ t → 0 \bar{\alpha}_t \to 0 αˉt0 ⇒ \Rightarrow x T → N ( 0 , I ) x_T \to \mathcal{N}(0, \boldsymbol{I}) xTN(0,I)

3. 反向过程推导

如果我们能够逆转上述过程并从 q ( x t − 1 ∣ x t ) q\left(x_{t-1} \mid x_t\right) q(xt1xt) 采样,就可以从高斯噪声 x T ∼ N ( 0 , I ) x_T \sim \mathcal{N}(0, \boldsymbol{I}) xTN(0,I) 还原出原图分布 x 0 ∼ q ( x ) x_0 \sim q(x) x0q(x) 。因为我们无法直接推断出 q ( x t − 1 ∣ x t ) q\left(x_{t-1} \mid x_t\right) q(xt1xt) ,所以通过神经网络去预测/拟合这样的一个逆向的分布 p θ ( x t − 1 ∣ x t ) p_\theta\left(x_{t-1} \mid x_t\right) pθ(xt1xt)

3.1 真实反向分布结论

当已知 x 0 x_0 x0时,反向分布可解析求得:

q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ t , β ~ t I ) q(x_{t-1}|x_t,x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t, \tilde{\beta}_t \boldsymbol{I}) q(xt1xt,x0)=N(xt1;μ~t,β~tI)

其中: μ ~ t = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ t ) \tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_t\right) μ~t=αt 1(xt1αˉt 1αtϵt) β ~ t = 1 − α ˉ t − 1 1 − α ˉ t ⋅ β t \tilde{\beta}_t=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \cdot \beta_t β~t=1αˉt1αˉt1βt

3.2 推导

q ( x t − 1 ∣ x t , x 0 ) = q ( x t , x 0 , x t − 1 ) q ( x t , x 0 ) = q ( x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x t − 1 , x 0 ) q ( x 0 ) q ( x t ∣ x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) = q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ exp ⁡ ( − 1 2 ( ( x t − α t x t − 1 ) 2 β t + ( x t − 1 − α ˉ t − 1 x 0 ) 2 1 − a ˉ t − 1 − ( x t − a ˉ t x 0 ) 2 1 − a ˉ t ) ) = exp ⁡ ( − 1 2 ( ( ( α t β t + 1 1 − α ˉ t − 1 ) x t − 1 2 ⏟ x t − 1  万差  − ( 2 α t β t x t + 2 a ˉ t − 1 1 − α ˉ t − 1 x 0 ) x t − 1 ⏟ x t − 1  均值  + C ( x t , x 0 ) ⏟ 与  x t − 1  无关  ) . \begin{aligned} & q\left(x_{t-1} \mid x_t, x_0\right) =\frac{q\left(x_t, x_0, x_{t-1}\right)}{q\left(x_t, x_0\right)} =\frac{q\left(x_0\right) q\left(x_{t-1} \mid x_0\right) q\left(x_t \mid x_{t-1}, x_0\right)}{q\left(x_0\right) q\left(x_t \mid x_0\right)} \\ & =q\left(x_t \mid x_{t-1}, x_0\right) \frac{q\left(x_{t-1} \mid x_0\right)}{q\left(x_t \mid x_0\right)} = q\left(x_t \mid x_{t-1}\right) \frac{q\left(x_{t-1} \mid x_0\right)}{q\left(x_t \mid x_0\right)} \\ & \propto \exp \left(-\frac{1}{2}\left(\frac{\left(x_t-\sqrt{\alpha_t} x_{t-1}\right)^2}{\beta_t}+\frac{\left(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}} x_0\right)^2}{1-\bar{a}_{t-1}}-\frac{\left(x_t-\sqrt{\bar{a}_t} x_0\right)^2}{1-\bar{a}_t}\right)\right) \\ & =\exp (-\frac{1}{2}(\underbrace{\left(\left(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}\right) x_{t-1}^2\right.}_{x_{t-1} \text { 万差 }}-\underbrace{\left(\frac{2 \sqrt{\alpha_t}}{\beta_t} x_t+\frac{2 \sqrt{\bar{a}_{t-1}}}{1-\bar{\alpha}_{t-1}} x_0\right) x_{t-1}}_{x_{t-1} \text { 均值 }}+\underbrace{C\left(x_t, x_0\right)}_{\text {与 } x_{t-1} \text { 无关 }}) . \end{aligned} q(xt1xt,x0)=q(xt,x0)q(xt,x0,xt1)=q(x0)q(xtx0)q(x0)q(xt1x0)q(xtxt1,x0)=q(xtxt1,x0)q(xtx0)q(xt1x0)=q(xtxt1)q(xtx0)q(xt1x0)exp(21(βt(xtαt xt1)2+1aˉt1(xt1αˉt1 x0)21aˉt(xtaˉt x0)2))=exp(21(xt1 万差  ((βtαt+1αˉt11)xt12xt1 均值  (βt2αt xt+1αˉt12aˉt1 x0)xt1+ xt1 无关  C(xt,x0)).
注解: 推导过程用到了贝叶斯公式,马尔科夫性,以及高斯公式
贝叶斯公式: P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ) P(A|B) = \frac{P(B|A)P(A)}{P(B)} P(AB)=P(B)P(BA)P(A);
因为扩散过程是一个马尔科夫过程,所以有 q ( x t ∣ x t − 1 , x 0 ) = q ( x t ∣ x t − 1 ) q\left(x_t \mid x_{t-1}, x_0\right) = q\left(x_t \mid x_{t-1}\right) q(xtxt1,x0)=q(xtxt1);
将条件概率转化为概率密度的表达形式,它们是成正相关或认为成正比关系的。高斯分布 x ∼ N ( μ , σ 2 ) x \sim N \left(\mu,\sigma^2\right) xN(μ,σ2) 的概率密度函数为:
f ( x ) = 1 2 π σ exp ⁡ ( − ( x − μ ) 2 2 σ 2 ) f(x)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(\frac{-(x-\mu)^2}{2 \sigma^2}\right) f(x)=2π σ1exp(2σ2(xμ)2)

根据上述公式,进一步可以获得方差和均值:
方差:(观察可知方差为常数值)
β ~ t = 1 / ( α t β t + 1 1 − α ˉ t − 1 ) = 1 / ( α t − α ˉ t + β t β t ( 1 − α ˉ t − 1 ) ) = 1 − α ˉ t − 1 1 − α ˉ t ⋅ β t \tilde{\beta}_t=1 /\left(\frac{\alpha_t}{\beta_t}\right. \left.+\frac{1}{1-\bar{\alpha}_{t-1}}\right)=1 /\left(\frac{\alpha_t-\bar{\alpha}_t+\beta_t}{\beta_t\left(1-\bar{\alpha}_{t-1}\right)}\right)=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \cdot \beta_t β~t=1/(βtαt+1αˉt11)=1/(βt(1αˉt1)αtαˉt+βt)=1αˉt1αˉt1βt
均值:
μ ~ t ( x t , x 0 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) / ( α t β t + 1 1 − α ˉ t − 1 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) 1 − α ˉ t − 1 1 − α ˉ t ⋅ β t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t x 0 \begin{aligned} \tilde{{\mu}}_t\left({x}_t, {x}_0\right) & =\left(\frac{\sqrt{\alpha_t}}{\beta_t} {x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}} {x}_0\right) /\left(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}\right) \\ & =\left(\frac{\sqrt{\alpha_t}}{\beta_t} {x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}} {x}_0\right) \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \cdot \beta_t \\ & =\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} {x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} {x}_0 \end{aligned} μ~t(xt,x0)=(βtαt xt+1αˉt1αˉt1 x0)/(βtαt+1αˉt11)=(βtαt xt+1αˉt1αˉt1 x0)1αˉt1αˉt1βt=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtx0

x 0 x_0 x0表示为: x 0 = 1 α ˉ t ( x t − 1 − α ˉ t ϵ t ) x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon_t) x0=αˉt 1(xt1αˉt ϵt),代入上式得:
μ ~ t ( x t , x 0 ) = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t x 0 = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t ⋅ 1 α ˉ t ( x t − 1 − α ˉ t z t ) = α t ⋅ α t ( 1 − α ˉ t − 1 ) α t ⋅ ( 1 − α ˉ t ) x t + α ˉ t − 1 β t 1 − α ˉ t ⋅ 1 α ˉ t ( x t − 1 − α ˉ t z t ) = α t − α ˉ t α t ( 1 − α ˉ t ) x t + β t ( 1 − α ˉ t ) α t ( x t − 1 − α ˉ t z t ) = 1 − α ˉ t α t ( 1 − α ˉ t ) x t − β t ( 1 − α ˉ t ) α t ( 1 − α ˉ t z t ) = 1 α t x t − β t ( 1 − α ˉ t ) α t z t = 1 α t ( x t − β t ( 1 − α ˉ t ) z t ) \begin{aligned} \tilde{\mu}_t\left( {x}_t, {x}_0\right) & =\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} {x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} {x}_0 \\ & =\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} {x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}}\left( {x}_t-\sqrt{1-\bar{\alpha}_t} z_t\right) \\ & =\frac{\sqrt{\alpha_t} \cdot \sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{\sqrt{\alpha_t} \cdot\left(1-\bar{\alpha}_t\right)} {x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}}\left(x_t-\sqrt{1-\bar{\alpha}_t} z_t\right) \\ & =\frac{\alpha_t-\bar{\alpha}_t}{\sqrt{\alpha_t}\left(1-\bar{\alpha}_t\right)} {x}_t+\frac{\beta_t}{\left(1-\bar{\alpha}_t\right) \sqrt{\alpha_t}}\left(x_t-\sqrt{1-\bar{\alpha}_t} z_t\right) \\ & =\frac{1-\bar{\alpha}_t}{\sqrt{\alpha_t}\left(1-\bar{\alpha}_t\right)} {x}_t-\frac{\beta_t}{\left(1-\bar{\alpha}_t\right) \sqrt{\alpha_t}}\left(\sqrt{1-\bar{\alpha}_t} z_t\right) \\ & =\frac{1}{\sqrt{\alpha_t}} {x}_t-\frac{\beta_t}{\sqrt{\left(1-\bar{\alpha}_t\right)} \sqrt{\alpha_t}} z_t \\ & =\frac{1}{\sqrt{\alpha_t}}\left( {x}_t-\frac{\beta_t}{\sqrt{\left(1-\bar{\alpha}_t\right)}} z_t\right) \end{aligned} μ~t(xt,x0)=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtx0=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtαˉt 1(xt1αˉt zt)=αt (1αˉt)αt αt (1αˉt1)xt+1αˉtαˉt1 βtαˉt 1(xt1αˉt zt)=αt (1αˉt)αtαˉtxt+(1αˉt)αt βt(xt1αˉt zt)=αt (1αˉt)1αˉtxt(1αˉt)αt βt(1αˉt zt)=αt 1xt(1αˉt) αt βtzt=αt 1(xt(1αˉt) βtzt)
其中 β t = 1 − α t \beta_t=1-\alpha_t βt=1αt,则最终结果为:
μ ~ t = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ t ) \tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_t\right) μ~t=αt 1(xt1αˉt 1αtϵt)

4. 训练目标

为什么预测噪声而不是直接预测图像?1

  1. 简化建模问题
    不同时间步共享相同的噪声预测目标。考虑到我们每次只需要预测一个特定的时间步(t)的噪声,网络的目标变得更加明确且简单。直接预测图像 x_0 的像素值,则意味着模型需要从噪声中恢复整个图像结构,这在高维空间中是一个非常复杂的问题。

更进一步,预测噪声本质上是一个去噪过程,这个过程相对更加容易拟合和收敛。

  1. 稳定性和收敛性
    在扩散模型中,噪声是添加到每个像素上的随机扰动。通过学习从噪声中恢复出原始图像的噪声成分,网络本质上是在学习图像的细节,而不是整个图像结构。因此,通过减少噪声的预测误差,模型能够更加稳定地训练。

1: 第 4 期:DDPM中的损失函数——为什么只预测噪声? 原文链接:https://blog.youkuaiyun.com/m0_45101613/article/details/147340147

4.1 变分下界

似然函数 p θ ( x 0 ) p_\theta\left(x_0\right) pθ(x0) 表示在模型参数 θ \theta θ 下,观测数据 x 0 x_0 x0 出现的概率。最大化似然即:

max ⁡ θ log ⁡ p θ ( x 0 ) \max _\theta \log p_\theta\left(x_0\right) θmaxlogpθ(x0)

等价于让模型生成的数据分布 p θ p_\theta pθ 尽可能接近真实数据分布 q ( x 0 ) q\left(x_0\right) q(x0)

由于似然函数求积分通常无法直接计算,因此我们通过变分下界(ELBO)来近似:
log ⁡ p ( x ) = log ⁡ ∫ p ( x 0 : T ) d x 1 : T = log ⁡ ∫ p ( x 0 : T ) q ( x 1 : T ∣ x 0 ) q ( x 1 : T ∣ x 0 ) d x 1 : T = log ⁡ E q ( x 1 : T ∣ x 0 ) [ p ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] ≥ E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] \begin{aligned} \log p(\boldsymbol{x}) & =\log \int p\left(\boldsymbol{x}_{0: T}\right) d \boldsymbol{x}_{1: T} \\ & =\log \int \frac{p\left(\boldsymbol{x}_{0: T}\right) q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)} d \boldsymbol{x}_{1: T} \\ & =\log \mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\frac{p\left(\boldsymbol{x}_{0: T}\right)}{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\right] \\ & \geq \mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_{0: T}\right)}{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\right] \end{aligned} logp(x)=logp(x0:T)dx1:T=logq(x1:Tx0)p(x0:T)q(x1:Tx0)dx1:T=logEq(x1:Tx0)[q(x1:Tx0)p(x0:T)]Eq(x1:Tx0)[logq(x1:Tx0)p(x0:T)]
注解: 上式中不等式,由琴生不等式得到。琴生不等式:对于凹函数(Concave Function)​​ ϕ 和随机变量 X,有(E[X])≥E[ϕ(X)]。对数函数 log(⋅)​​ 是凹函数(其二阶导数为负)。

E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ∏ t = 1 T q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 2 T p θ ( x t − 1 ∣ x t ) q ( x T ∣ x T − 1 ) ∏ t = 1 T − 1 q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) ∏ t = 1 T − 1 p θ ( x t ∣ x t + 1 ) q ( x T ∣ x T − 1 ) ∏ t = 1 T − 1 q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) p θ ( x 0 ∣ x 1 ) q ( x T ∣ x T − 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ log ⁡ ∏ t = 1 T − 1 p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) q ( x T ∣ x T − 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ ∑ t = 1 T − 1 log ⁡ p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] + E q ( x 1 : T ∣ x 0 ) [ log ⁡ p ( x T ) q ( x T ∣ x T − 1 ) ] + ∑ t = 1 T − 1 E q ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] + E q ( x T − 1 , x T ∣ x 0 ) [ log ⁡ p ( x T ) q ( x T ∣ x T − 1 ) ] + ∑ t = 1 T − 1 E q ( x t − 1 , x t , x t + 1 ∣ x 0 ) [ log ⁡ p θ ( x t ∣ x t + 1 ) q ( x t ∣ x t − 1 ) ] = E q ( x 1 ∣ x 0 ) [ log ⁡ p θ ( x 0 ∣ x 1 ) ] ⏟ L0 reconstruction term  − E q ( x T − 1 ∣ x 0 ) [ D K L ( q ( x T ∣ x T − 1 ) ∥ p ( x T ) ) ] ⏟ LT prior matching term  − ∑ t = 1 T − 1 E q ( x t − 1 , x t + 1 ∣ x 0 ) [ D K L ( q ( x t ∣ x t − 1 ) ∥ p θ ( x t ∣ x t + 1 ) ) ] ⏟ Lt consistency term  \begin{aligned} & \mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_{0: T}\right)}{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\right] \\ & =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right) \prod_{t=1}^T p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t\right)}{\prod_{t=1}^T q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right) \prod_{t=2}^T p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right) \prod_{t=1}^{T-1} q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right) \prod_{t=1}^{T-1} p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right) \prod_{t=1}^{T-1} q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right)}\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \prod_{t=1}^{T-1} \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right)}\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\sum_{t=1}^{T-1} \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right)}\right]+\sum_{t=1}^{T-1} \mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}_0\right)}\left[\log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{T-1}, \boldsymbol{x}_T \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right)}\right]+\sum_{t=1}^{T-1} \mathbb{E}_{q\left(\boldsymbol{x}_{t-1}, \boldsymbol{x}_t, \boldsymbol{x}_{t+1} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\ & =\underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}_0\right)}\left[\log p_\theta\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)\right]}_{\text {L0 reconstruction term }} -\underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_{T-1} \mid \boldsymbol{x}_0\right)}\left[D_{ \boldsymbol{KL}}\left(q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right) \| p\left(\boldsymbol{x}_T\right)\right)\right]}_{\text {LT prior matching term }} \\ & -\sum_{t=1}^{T-1} \underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_{t-1}, \boldsymbol{x}_{t+1} \mid \boldsymbol{x}_0\right)}\left[D_{ \boldsymbol{KL}}\left(q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right) \| p_\theta\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)\right)\right]}_{\text {Lt consistency term }} \end{aligned} Eq(x1:Tx0)[logq(x1:Tx0)p(x0:T)]=Eq(x1:Tx0)[logt=1Tq(xtxt1)p(xT)t=1Tpθ(xt1xt)]=Eq(x1:Tx0)[logq(xTxT1)t=1T1q(xtxt1)p(xT)pθ(x0x1)t=2Tpθ(xt1xt)]=Eq(x1:Tx0)[logq(xTxT1)t=1T1q(xtxt1)p(xT)pθ(x0x1)t=1T1pθ(xtxt+1)]=Eq(x1:Tx0)[logq(xTxT1)p(xT)pθ(x0x1)]+Eq(x1:Tx0)[logt=1T1q(xtxt1)pθ(xtxt+1)]=Eq(x1:Tx0)[logpθ(x0x1)]+Eq(x1:Tx0)[logq(xTxT1)p(xT)]+Eq(x1:Tx0)[t=1T1logq(xtxt1)pθ(xtxt+1)]=Eq(x1:Tx0)[logpθ(x0x1)]+Eq(x1:Tx0)[logq(xTxT1)p(xT)]+t=1T1Eq(x1:Tx0)[logq(xtxt1)pθ(xtxt+1)]=Eq(x1x0)[logpθ(x0x1)]+Eq(xT1,xTx0)[logq(xTxT1)p(xT)]+t=1T1Eq(xt1,xt,xt+1x0)[logq(xtxt1)pθ(xtxt+1)]=L0 reconstruction term  Eq(x1x0)[logpθ(x0x1)]LT prior matching term  Eq(xT1x0)[DKL(q(xTxT1)p(xT))]t=1T1Lt consistency term  Eq(xt1,xt+1x0)[DKL(q(xtxt1)pθ(xtxt+1))]
展开后包含三项:

  1. L T L_T LT: 先验匹配项。作用​​:约束最终噪声分布接近标准正态。当 T T T足够大时, q ( x T ∣ x 0 ) q\left(\boldsymbol{x}_{T} \mid \boldsymbol{x}_0\right) q(xTx0) 已是标准正态,此项可忽略。
  2. L t L_{t} Lt: 去噪匹配项(核心项)。真实分布与模型分布的KL散度。
  3. L 0 L_0 L0: 重构项。通常用离散化后的高斯分布或固定方差处理。

优化目标为:

L VLB = E q [ − log ⁡ p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] \mathcal{L}_{\text{VLB}} = \mathbb{E}_q\left[-\log\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] LVLB=Eq[logq(x1:Tx0)pθ(x0:T)]
L t = E x 0 , ϵ [ 1 2 ∥ Σ θ ( x t , t ) ∥ 2 2 ∥ μ θ ( x t , t ) ∥ 2 ] = E x 0 , ϵ [ 1 2 ∥ Σ θ ∥ 2 2 ∥ 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ θ ( x t , t ) ) ∥ 2 ] = E x 0 , ϵ [ ( 1 − α t ) 2 2 α t ( 1 − α ˉ t ) ∥ Σ θ ∥ 2 2 ∥ ϵ t − ϵ θ ( x t , t ) ∥ 2 ] = E x 0 , ϵ [ ( 1 − α t ) 2 2 α t ( 1 − α ˉ t ) ∥ Σ θ ∥ 2 2 ∥ ϵ t − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ t , t ) ∥ 2 ] \begin{aligned} L_t & =\mathbb{E}_{ \boldsymbol{x}_0, \epsilon} \left[\frac{1}{2\left\|\boldsymbol{\Sigma}_\theta\left( \boldsymbol{x}_t, t\right)\right\|_2^2}\left\| \mu_\theta\left( \boldsymbol{x}_t, t\right)\right\|^2\right] \\ & =\mathbb{E}_{ \boldsymbol{x}_0, \epsilon}\left[\frac{1}{2\left\|\boldsymbol{\Sigma}_\theta\right\|_2^2}\left\| \frac{1}{\sqrt{\alpha_t}}\left( \boldsymbol{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta\left( \boldsymbol{x}_t, t\right)\right)\right\|^2\right] \\ & =\mathbb{E}_{ \boldsymbol{x}_0, \epsilon}\left[\frac{\left(1-\alpha_t\right)^2}{2 \alpha_t\left(1-\bar{\alpha}_t\right)\left\|\boldsymbol{\Sigma}_\theta\right\|_2^2}\left\|\boldsymbol{\epsilon}_t-\boldsymbol{\epsilon}_\theta\left( \boldsymbol{x}_t, t\right)\right\|^2\right] \\ & =\mathbb{E}_{ \boldsymbol{x}_0, \epsilon}\left[\frac{\left(1-\boldsymbol{\alpha}_t\right)^2}{2 \alpha_t\left(1-\bar{\alpha}_t\right)\left\|\boldsymbol{\Sigma}_\theta\right\|_2^2}\left\|\boldsymbol{\epsilon}_t-\boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t} \boldsymbol{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_t, t\right)\right\|^2\right] \end{aligned} Lt=Ex0,ϵ[2Σθ(xt,t)221μθ(xt,t)2]=Ex0,ϵ[2Σθ221 αt 1(xt1αˉt 1αtϵθ(xt,t)) 2]=Ex0,ϵ[2αt(1αˉt)Σθ22(1αt)2ϵtϵθ(xt,t)2]=Ex0,ϵ[2αt(1αˉt)Σθ22(1αt)2 ϵtϵθ(αˉt x0+1αˉt ϵt,t) 2]
其中 μ θ ( x t , t ) = 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ θ ( x t , t ) ) \begin{aligned} \boldsymbol{\mu}_\theta\left( \boldsymbol{x}_t, t\right) =\frac{1}{\sqrt{\alpha_t}}\left( \boldsymbol{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left( \boldsymbol{x}_t, t\right)\right) \\ \end{aligned} μθ(xt,t)=αt 1(xt1αˉt 1αtϵθ(xt,t))
x t − 1 = N ( x t − 1 ; 1 α t ( x t − 1 − α t 1 − α ˉ t ϵ θ ( x t , t ) ) , Σ θ ( x t , t ) ) \boldsymbol{x}_{t-1} =\mathcal{N}\left( \boldsymbol{x}_{t-1} ; \frac{1}{\sqrt{\alpha_t}}\left( \boldsymbol{x}_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left( \boldsymbol{x}_t, t\right)\right), \boldsymbol{\Sigma}_\theta\left( \boldsymbol{x}_t, t\right)\right) xt1=N(xt1;αt 1(xt1αˉt 1αtϵθ(xt,t)),Σθ(xt,t))

实际使用简化形式:

L simple = E t , x 0 , ϵ [ ∥ ϵ − ϵ θ ( x t , t ) ∥ 2 ] = E t ∼ [ 1 , T ] , x 0 , ϵ t [ ∥ ϵ t − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ t , t ) ∥ 2 ] \begin{aligned} \mathcal{L}_{\text{simple}} &= \mathbb{E}_{t,x_0,\epsilon}\left[\|\epsilon - \epsilon_\theta(x_t,t)\|^2\right] \\ &=\mathbb{E}_{t \sim[1, T], \boldsymbol{x}_{0, \epsilon_t}}\left[\left\|\epsilon_t-\epsilon_\theta\left(\sqrt{\bar{\alpha}_t} \boldsymbol{x}_0+\sqrt{1-\bar{\alpha}_t} \epsilon_t, t\right)\right\|^2\right] \end{aligned} Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]=Et[1,T],x0,ϵt[ ϵtϵθ(αˉt x0+1αˉt ϵt,t) 2]

5. 采样算法

在这里插入图片描述

  1. x T ∼ N ( 0 , I ) x_T \sim \mathcal{N}(0, \boldsymbol{I}) xTN(0,I)开始
  2. 对于 t = T , . . . , 1 t = T,...,1 t=T,...,1:
    • 预测噪声 ϵ θ ( x t , t ) \epsilon_\theta(x_t,t) ϵθ(xt,t)
    • 计算均值 μ θ ( x t , t ) \mu_\theta(x_t,t) μθ(xt,t)
    • 采样 x t − 1 ∼ N ( μ θ , σ t 2 I ) x_{t-1} \sim \mathcal{N}(\mu_\theta,\sigma_t^2 \boldsymbol{I}) xt1N(μθ,σt2I)
  3. 输出 x 0 x_0 x0

6. 实现细节

超参数典型值说明
T1000总步数
β min \beta_{\text{min}} βmin1e-4起始噪声
β max \beta_{\text{max}} βmax0.02最终噪声
网络架构U-Net带时间嵌入

7. 核心公式总结

  1. 前向过程:
    x t = α ˉ t x 0 + 1 − α ˉ t ϵ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon xt=αˉt x0+1αˉt ϵ

  2. 反向均值:
    μ ~ t = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ) \tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta\right) μ~t=αt 1(xt1αˉt βtϵθ)

  3. 训练目标:
    min ⁡ θ ∥ ϵ − ϵ θ ( x t , t ) ∥ 2 \min_\theta \|\epsilon - \epsilon_\theta(x_t,t)\|^2 θminϵϵθ(xt,t)2

参考资料

  1. 扩散模型原理+DDPM案例代码解析
  2. DDPM扩散模型公式推理----扩散和逆扩散过程
  3. What are Diffusion Models?
  4. 原始论文:Denoising Diffusion Probabilistic Models

如有错误欢迎指正。

如有疑问欢迎留言讨论。

### 关于DDPM模型的资料、实现与教程 #### 1. **理论基础** 扩散概率模型(Diffusion Probabilistic Models, DPMs),特别是去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs),是一种基于连续时间框架的概率生成模型。其核心思想是从噪声中逐步恢复清晰的数据样本,通过一系列前向过程引入噪声,并利用反向过程去除这些噪声[^4]。 为了深入理解DDPM的工作原理及其背后的数学机制,可以参考以下资源: - 文章《The Annotated Diffusion Model》提供了一个详细注解版本,帮助初学者快速掌握扩散模型的核心概念和公式推导[^1]。 - 另外,《How does Stable Diffusion work?》是一篇通俗易懂的文章,适合希望了解扩散模型实际应用场景的人群。 #### 2. **代码实现** 对于想要动手实践的人来说,以下是几个值得推荐的开源项目和教程: - PyTorch-DDPM 是一个专注于PyTorch框架下的扩散模型教学工具包。它不仅包了完整的训练脚本,还附带了 `ddpm_example.yml` 配置文件作为入门指南[^2]。该配置文件涵盖了模型定义、数据集加载以及训练超参数调整等多个方面,非常适合新手上手。 - 如果更倾向于阅读简洁明了的Python代码,则可以从 GitHub 上搜索关键词 “pytorch beginner ddpm”,找到许多由社区贡献的小型实验性项目。这类项目的优点在于结构简单且易于扩展。 #### 3. **进阶学习材料** 当具备一定基础知识之后,可以通过下面几份文档进一步提升自己的技能水平: - 大白话AI系列文章——图像生成模型章节专门讨论了包括但不限于VAE、GAN在内的多种主流生成对抗网络家族成员之间的异同点,同时也介绍了如何过渡至最新的AIGC领域趋势[^3]。 - 对学术论文感兴趣的读者可以直接查阅 Jonathan Ho 等人在 NeurIPS 2020 发表的经典之作《Denoising Diffusion Probabilistic Models》,这是现代版DDPM算法诞生的重要里程碑之一。 ```python import torch from torchvision import datasets, transforms from pytorch_lightning import Trainer, LightningModule class SimpleDDPM(LightningModule): def __init__(self, config_path="ddpm_example.yml"): super().__init__() self.config = load_config(config_path) def forward(self, x_t, timesteps=None): pass if __name__ == "__main__": transform = transforms.Compose([ transforms.ToTensor(), lambda x: (x * 2) - 1, ]) dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True) ``` 上述代码片段展示了一个简化版的DDPM类初始化逻辑及数据预处理流程示例。 --- ####
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值