Denoising Diffusion Probabilistic Models -- 概率扩散模型 数学推导(2)

Denoising Diffusion Probabilistic Models – 概率扩散模型 数学推导(1)

三. 学习目标

现在回顾前面扩散过程和生成过程的公式

  • 扩散过程 q ( x t ∣ x 0 ) ∼ N ( α t ˉ x 0 , 1 − α t ˉ I ) q(x_t|x_0) \sim \mathcal{N}(\sqrt{\bar{\alpha_t}}x_0, \sqrt{1-\bar{\alpha_t}}I) q(xtx0)N(αtˉ x0,1αtˉ I)
  • 生成过程: q ( X t − 1 ∣ X t , X 0 ) ∼ N ( X t − 1 ; 1 α t ( X t − 1 − α t 1 − α ‾ t ϵ t ) , 1 − α ‾ t − 1 1 − α ‾ t ( 1 − α t ) I ) q\left(X_{t-1}|X_t,X_0\right) \sim N\left(X_{t-1};\frac{1}{\sqrt{\alpha_t}}(X_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_t),\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)I\right) q(Xt1Xt,X0)N(Xt1;αt 1(Xt1αt 1αtϵt),1αt1αt1(1αt)I)

对于扩散过程, 都是已知的参数, 故而不需要神经网络进行训练

对于生成过程, 方差 1 − α ‾ t − 1 1 − α ‾ t ( 1 − α t ) \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right) 1αt1αt1(1αt)是已知参数, 但均值中的 ϵ t \epsilon_t ϵt 就是前向过程中添加的高斯噪声, 在生成过程, 我们就是要使用神经网络预测添加的随机高斯噪声, 然后从 X 0 X_0 X0中去除 ϵ t \epsilon_t ϵt, 所以 ϵ t \epsilon_t ϵt就是唯一的训练参数, 记为 θ \theta θ

所以:
p θ ( X t − 1 ∣ X t ) = q ( X t − 1 ∣ X t ) = N ( X t − 1 ; 1 α t ( X t − 1 − α t 1 − α ‾ t ϵ t ) , 1 − α ‾ t − 1 1 − α ‾ t ( 1 − α t ) I ) \begin{aligned} p_{\theta}(X_{t-1}|X_t) &= q(X_{t-1}|X_t) \\ &=N\left(X_{t-1};\frac{1}{\sqrt{\alpha_t}}(X_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_t),\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)I\right) \\ \end{aligned} pθ(Xt1Xt)=q(Xt1Xt)=N(Xt1;αt 1(Xt1αt 1αtϵt),1αt1αt1(1αt)I)
生成过程也就可以表示为:
p θ ( X 0 : t ) = p ( X t ) ∏ t = 1 T p θ ( X t − 1 ∣ X t ) p_{\theta}(X_{0:t}) = p(X_t)\prod_{t=1}^Tp_{\theta}(X_{t-1}|X_t) pθ(X0:t)=p(Xt)t=1Tpθ(Xt1Xt)

四. 损失函数

通过前面的推导可知, 我们在生成过程实际上是想找到一个 θ \theta θ, 使得 p θ ( X 0 ) p_{\theta}(X_0) pθ(X0)最大, 我们可以使用概率模型中常见的极大似然估计来估计参数 θ \theta θ:
arg max ⁡ θ log ⁡ p θ ( X 0 ) \argmax_{\theta}\log p_{\theta}(X_0) θargmaxlogpθ(X0)

arg max ⁡ θ log ⁡ p θ ( X 0 ) = ∫ X 1 : X T q ( X 1 : X T ∣ X 0 ) log ⁡ p θ ( X 0 ) d X 1 : X T = ∫ X 1 : X T q ( X 1 : X T ∣ X 0 ) log ⁡ p θ ( X 0 : X T ) p θ ( X 1 : X T ∣ X 0 ) d X 1 : X T = ∫ X 1 : X T q ( X 1 : X T ∣ X 0 ) log ⁡ p θ ( X 0 : X T ) q ( X 1 : X T ∣ X 0 ) p θ ( X 1 : X T ∣ X 0 ) q ( X 1 : X T ∣ X 0 ) d X 1 : X T = ∫ X 1 : X T q ( X 1 : X T ∣ X 0 ) log ⁡ p θ ( X 0 : X T ) q ( X 1 : X T ∣ X 0 ) d X 1 : X T + ∫ X 1 : X T q ( X 1 : X T ∣ X 0 ) log ⁡ q ( X 1 : X T ∣ X 0 ) p θ ( X 1 : X T ∣ X 0 ) d X 1 : X T = ∫ X 1 : X T q ( X 1 : X T ∣ X 0 ) log ⁡ p θ ( X 0 : X T ) q ( X 1 : X T ∣ X 0 ) d X 1 : X T + D K L ( q ( X 1 : X T ∣ X 0 ) ∣ ∣ p θ ( X 1 : X T ∣ X 0 ) ) 因为 K L 散度大于等于 0 ⩾ ∫ X 1 : X T q ( X 1 : X T ∣ X 0 ) log ⁡ p θ ( X 0 : X T ) q ( X 1 : X T ∣ X 0 ) d X 1 : X T = E X 1 : X T ∼ q ( X 1 : X T ∣ X 0 ) [ log ⁡ p θ ( X 0 : X T ) q ( X 1 : X T ∣ X 0 ) ] \begin{aligned} \argmax_{\theta}\log p_{\theta}(X_0)&=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log p_{\theta}(X_{0})dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{p_{\theta}(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{1}:X_{T}|X_{0})q(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}}+\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}} \\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}}+D_{KL}(q(X_{1}:X_{T}|X_{0}) || p_{\theta}(X_{1}:X_{T}|X_{0})) \\ &因为KL 散度大于等于0 \\ &\geqslant \int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}} \\ &= E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}[\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}] \end{aligned} θargmaxlogpθ(X0)=X1:XTq(X1:XTX0)logpθ(X0)dX1:XT=X1:XTq(X1:XTX0)logpθ(X1:XTX0)pθ(X0:XT)dX1:XT=X1:XTq(X1:XTX0)logpθ(X1:XTX0)q(X1:XTX0)pθ(X0:XT)q(X1:XTX0)dX1:XT=X1:XTq(X1:XTX0)logq(X1:XTX0)pθ(X0:XT)dX1:XT+X1:XTq(X1:XTX0)logpθ(X1:XTX0)q(X1:XTX0)dX1:XT=X1:XTq(X1:XTX0)logq(X1:XTX0)pθ(X0:XT)dX1:XT+DKL(q(X1:XTX0)∣∣pθ(X1:XTX0))因为KL散度大于等于0X1:XTq(X1:XTX0)logq(X1:XTX0)pθ(X0:XT)dX1:XT=EX1:XTq(X1:XTX0)[logq(X1:XTX0)pθ(X0:XT)]
两边取负号
− log ⁡ p θ ( X 0 ) ⩽ − E X 1 : X T ∼ q ( X 1 : X T ∣ X 0 ) [ log ⁡ p θ ( X 0 : X T ) q ( X 1 : X T ∣ X 0 ) ] ⩽ E X 1 : X T ∼ q ( X 1 : X T ∣ X 0 ) [ log ⁡ q ( X 1 : X T ∣ X 0 ) p θ ( X 0 : X T ) ] \begin{aligned} -\log p_{\theta}(X_0) \leqslant - E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}[\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}] \\ \leqslant E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}[\log\frac{q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{0}:X_{T})}] \end{aligned} logpθ(X0)EX1:XTq(X1:XTX0)[logq(X1:XTX0)pθ(X0:XT)]EX1:XTq(X1:XTX0)[logpθ(X0:XT)q(X1:XTX0)]

即得到了:
L V L B = E q ( X 0 : T ) [ l o g q ( X 1 : T ∣ X 0 ) p θ ( X 0 : T ) ] ≥ − log ⁡ p θ ( X 0 ) \begin{aligned} L_{VLB }= E_{\mathrm{q(X_{0:T})}}\left[log\frac{q(X_{1:T}|X_{0})}{p_{\theta}(X_{0:T})}\right] &\geq -\log p_{\theta}(X_0) \\ \end{aligned} LVLB=Eq(X0:T)[logpθ(X0:T)q(X1:TX0)]logpθ(X0)
那么最小化 − log ⁡ p θ ( X 0 ) -\log p_{\theta}(X_0) logpθ(X0)也就变成了最小化其上界 L V L B L_{VLB } LVLB, 现在对 L V L B L_{VLB} LVLB进行进一步解析:
L V L B = E q ( X 0 : T ) [ l o g q ( X 1 : T ∣ X 0 ) p θ ( X 0 : T ) ] = E q ( X 0 : T ) [ log ⁡ ∏ t = 1 T q ( X t ∣ X t − 1 ) p θ ( X T ) ∏ t = 1 T p θ ( X t − 1 ∣ X t ) ] = E q ( X 0 : T ) [ − log ⁡ p θ ( X T ) + ∑ t = 1 T log ⁡ q ( X t ∣ X t − 1 ) p θ ( X t − 1 ∣ X t ) ] = E q ( X 0 : T ) [ − log ⁡ p θ ( X T ) + ∑ t = 2 T log ⁡ q ( X t ∣ X t − 1 ) p θ ( X t − 1 ∣ X t ) + log ⁡ q ( X 1 ∣ X 0 ) p θ ( X 0 ∣ X 1 ) ] \begin{aligned} L_{VLB }&= E_{\mathrm{q(X_{0:T})}}\left[log\frac{q(X_{1:T}|X_{0})}{p_{\theta}(X_{0:T})}\right] \\ &= E_{\mathrm{q(X_{0:T})}}\left[\log \frac{\prod_{t=1}^T q(X_{t}\mid X_{t-1})}{p_{\theta}(X_T)\prod_{t=1}^T p_{\theta}(X_{t-1}\mid X_t)} \right]\\ &= E_{\mathrm{q(X_{0:T})}}\left[-\log p_{\theta}(X_T) + \sum ^T_{t=1}\log \frac{q(X_{t}\mid X_{t-1})}{p_{\theta}(X_{t-1}\mid X_t)} \right]\\ &= E_{\mathrm{q(X_{0:T})}}\left[-\log p_{\theta}(X_T) + \sum ^T_{t=2}\log \frac{q(X_{t}\mid X_{t-1})}{p_{\theta}(X_{t-1}\mid X_t)} +\log \frac{q(X_{1}\mid X_{0})}{p_{\theta}(X_{0}\mid X_1)} \right]\\ \end{aligned} LVLB=Eq(X0:T)[logpθ(X0:T)q(X1:TX0)]=Eq(X0:T)[logpθ(XT)t=1Tpθ(Xt1Xt)t=1Tq(XtXt1)]=Eq(X0:T)[logpθ(XT)+t=1Tlogpθ(Xt1Xt)q(XtXt1)]=Eq(X0:T)[logpθ(XT)+t=2Tlogpθ(Xt1Xt)q(XtXt1)+logpθ(X0X1)q(X1X0)]

由马尔科夫性质 , 有 : q ( X t ∣ X t − 1 ) = q ( X t ∣ X t − 1 , X 0 ) = q ( X t , X t − 1 , X 0 ) q ( X t − 1 , X 0 ) = q ( X t − 1 ∣ X t , X 0 ) q ( X t ∣ X 0 ) q ( X t − 1 ∣ X 0 ) \begin{aligned} 由马尔科夫性质, 有:q\left(X_t \mid X_{t-1}\right)&=q\left(X_t \mid X_{t-1}, X_0\right)\\ & =\frac{q\left(X_t, X_{t-1}, X_0\right)}{q\left(X_{t-1}, X_0\right)} \\ & =\frac{q\left(X_{t-1} \mid X_t, X_0\right) q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)} \\ \end{aligned} 由马尔科夫性质,:q(XtXt1)=q(XtXt1,X0)=q(Xt1,X0)q(Xt,Xt1,X0)=q(Xt1X0)q(Xt1Xt,X0)q(XtX0)
L V L B = E q [ − log ⁡ p θ ( X T ) + ∑ t = 2 T log ⁡ ( q ( X t − 1 ∣ X t , X 0 ) p θ ( X t − 1 ∣ X t ) ⋅ q ( X t ∣ X 0 ) q ( X t − 1 ∣ X 0 ) ) + log ⁡ q ( X 1 ∣ X 0 ) p θ ( X 0 ∣ X 1 ) ] = E q [ − log ⁡ p θ ( X T ) + ∑ t = 2 T log ⁡ q ( X t − 1 ∣ X t , X 0 ) p θ ( X t − 1 ∣ X t ) + ∑ t = 2 T log ⁡ q ( X t ∣ X 0 ) q ( X t − 1 ∣ X 0 ) + log ⁡ q ( X 1 ∣ X 0 ) p θ ( X 0 ∣ X 1 ) ] = E q [ − log ⁡ p θ ( X T ) + ∑ t = 2 T log ⁡ q ( X t − 1 ∣ X t , X 0 ) p θ ( X t − 1 ∣ X t ) + log ⁡ q ( X T ∣ X 0 ) q ( X 1 ∣ X 0 ) + log ⁡ q ( X 1 ∣ X 0 ) p θ ( X 0 ∣ X 1 ) ] = E q [ log ⁡ 1 p θ ( X T ) + ∑ t = 2 T log ⁡ q ( X t − 1 ∣ X t , X 0 ) p θ ( X t − 1 ∣ X t ) + log ⁡ q ( X T ∣ X 0 ) q ( X 1 ∣ X 0 ) q ( X 1 ∣ X 0 ) p θ ( X 0 ∣ X 1 ) ] = E q [ log ⁡ 1 p θ ( X T ) + ∑ t = 2 T log ⁡ q ( X t − 1 ∣ X t , X 0 ) p θ ( X t − 1 ∣ X t ) + log ⁡ q ( X T ∣ X 0 ) + log ⁡ 1 p θ ( X 0 ∣ X 1 ) ] = E q [ log ⁡ q ( X t ∣ X 0 ) p θ ( X t ) + ∑ t = 2 T log ⁡ q ( X t − 1 ∣ X t , X 0 ) p θ ( X t − 1 ∣ X t ) − log ⁡ p θ ( X 0 ∣ X 1 ) ] = E q [ D K L ( q ( X t ∣ X 0 ) ∥ p θ ( X t ) ) + ∑ t = 2 T D K L ( q ( X t − 1 ∣ X t , X 0 ) ∥ p θ ( X t − 1 ∣ X t ) ) − log ⁡ p θ ( X 0 ∣ X 1 ) ] \begin{aligned} L_{V L B} & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \left(\frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)} \cdot \frac{q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)}\right)+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\sum_{t=2}^T \log \frac{q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)}+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log \frac{q\left(X_T \mid X_0\right)}{q\left(X_1 \mid X_0\right)}+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{1}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log \frac{q\left(X_T \mid X_0\right)}{q\left(X_1 \mid X_0\right)} \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{1}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log q\left(X_T \mid X_0\right) +\log\frac{1}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{q\left(X_t \mid X_0\right)}{p_\theta\left(X_t\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}-\log p_\theta\left(X_0 \mid X_1\right)\right] \\ & =E_{\mathrm{q}}\left[D_{K L}\left(q\left(X_t \mid X_0\right) \| p_\theta\left(X_t\right)\right)+\sum_{t=2}^T D_{K L}\left(q\left(X_{t-1} \mid X_t, X_0\right) \| p_\theta\left(X_{t-1} \mid X_t\right)\right)-\log p_\theta\left(X_0 \mid X_1\right)\right] \end{aligned} LVLB=Eq[logpθ(XT)+t=2Tlog(pθ(Xt1Xt)q(Xt1Xt,X0)q(Xt1X0)q(XtX0))+logpθ(X0X1)q(X1X0)]=Eq[logpθ(XT)+t=2Tlogpθ(Xt1Xt)q(Xt1Xt,X0)+t=2Tlogq(Xt1X0)q(XtX0)+logpθ(X0X1)q(X1X0)]=Eq[logpθ(XT)+t=2Tlogpθ(Xt1Xt)q(Xt1Xt,X0)+logq(X1X0)q(XTX0)+logpθ(X0X1)q(X1X0)]=Eq[logpθ(XT)1+t=2Tlogpθ(Xt1Xt)q(Xt1Xt,X0)+logq(X1X0)q(XTX0)pθ(X0X1)q(X1X0)]=Eq[logpθ(XT)1+t=2Tlogpθ(Xt1Xt)q(Xt1Xt,X0)+logq(XTX0)+logpθ(X0X1)1]=Eq[logpθ(Xt)q(XtX0)+t=2Tlogpθ(Xt1Xt)q(Xt1Xt,X0)logpθ(X0X1)]=Eq[DKL(q(XtX0)pθ(Xt))+t=2TDKL(q(Xt1Xt,X0)pθ(Xt1Xt))logpθ(X0X1)]
L V L B = L T + L T − 1 + … + L 0 L_{VLB} = L_T + L_{T-1} + \ldots + L_0 LVLB=LT+LT1++L0,其中
L T = D K L ( q ( X t ∣ X 0 ) ∥ p θ ( X t ) ) L t = D K L ( q ( X t − 1 ∣ X t , X 0 ) ∥ p θ ( X t − 1 ∣ X t ) ) , 1 ≤ t ≤ T − 1 L 0 = − log ⁡ p θ ( X 0 ∣ X 1 ) \begin{aligned} L_T &= D_{KL}\left(q\left(X_t|X_0\right) \parallel p_\theta\left(X_t\right)\right)\\ L_t &= D_{KL}\left(q\left(X_{t-1}|X_t, X_0\right) \parallel p_\theta\left(X_{t-1}|X_t\right)\right), \quad 1 \leq t \leq T-1\\ L_0 &= -\log p_\theta\left(X_0|X_1\right) \end{aligned} LTLtL0=DKL(q(XtX0)pθ(Xt))=DKL(q(Xt1Xt,X0)pθ(Xt1Xt)),1tT1=logpθ(X0X1)
接下来分别研究 L T , L t L_T,L_t LT,Lt L 0 : L_0: L0:

  • L T L_T LT不需要进行优化;因为 q ( X T ∣ X 0 ) q\left(X_T|X_0\right) q(XTX0)是已知的前向过程, p θ ( X T ) p_\theta\left(X_T\right) pθ(XT)是已知的纯高斯噪声的
    分布。因此 L T L_T LT已知,可以视为一个常数。

  • L 0 L_0 L0也不需要进行优化。DDPM 将 p θ ( X 0 ∣ X 1 ) p_\theta\left(X_0|X_1\right) pθ(X0X1)设置为了一个固定的过程,是一个从高斯分
    布中导出的独立的离散形式的编码过程。

对于 L t L_t Lt:

  • q ( X t − 1 ∣ X t , X 0 ) = N ( X t ; μ ~ ( X t + 1 , X 0 ) , β ~ t I ) q\left(X_{t-1}|X_t,X_0\right)=N\left(X_{t};\tilde{\mu}\left(X_{t+1},X_0\right),\widetilde{\beta}_tI\right) q(Xt1Xt,X0)=N(Xt;μ~(Xt+1,X0),β tI) 是可以求出来的
  • p θ ( X t ∣ X t + 1 ) = N ( X t ; μ θ ( X t , t ) , Σ θ ( X t , t ) ) p_{\theta}\left(X_{t}|X_{t+1}\right)=N\left(X_{t};\mu_{\theta}\left(X_{t},t\right),\Sigma_{\theta}\left(X_{t},t\right)\right) pθ(XtXt+1)=N(Xt;μθ(Xt,t),Σθ(Xt,t)), 是网络期望拟合的目标函数

由两高斯函数的KL散度为:
D K L ( P ∣ ∣ Q ) = l o g σ 2 σ 1 + σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 − 1 2 D_{KL}(P||Q)=log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2\sigma_2^2}-\frac{1}{2} DKL(P∣∣Q)=logσ1σ2+2σ22σ12+(μ1μ2)221
q ( X t − 1 ∣ X t , X 0 ) q\left(X_{t-1}|X_t,X_0\right) q(Xt1Xt,X0) p θ ( X t ∣ X t + 1 ) p_{\theta}\left(X_{t}|X_{t+1}\right) pθ(XtXt+1)的方差都是常数,所以需要优化的是这
两个高斯分布的均值的二范数 ( μ 1 − μ 2 ) 2 (\mu_1−\mu_2)^2 (μ1μ2)2,即优化:
L t = E q [ ∣ ∣ μ ~ ( X t + 1 , X 0 ) − μ θ ( X t , t ) ∣ ∣ 2 ] = E X 0 , ϵ [ ∣ ∣ 1 α t ( X t − β t 1 − α ‾ t ϵ t ) − μ θ ( X t , t ) ∣ ∣ 2 ] \begin{aligned} L_{t} & =E_{q}\left[||\tilde{\mu}\left(X_{t+1},X_{0}\right)-\mu_{\theta}\left(X_{t},t\right)||^{2}\right] \\ & =E_{X_{0},\epsilon}\left[||\frac{1}{\sqrt{\alpha_{t}}}\left(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t}\right)-\mu_{\theta}\left(X_{t},t\right)||^{2}\right] \end{aligned} Lt=Eq[∣∣μ~(Xt+1,X0)μθ(Xt,t)2]=EX0,ϵ[∣∣αt 1(Xt1αt βtϵt)μθ(Xt,t)2]
可以发现 μ θ ( X t , t ) \mu_{\theta}\left(X_{t},t\right) μθ(Xt,t)的优化目标是尽可能地接近 1 α t ( X t − β t 1 − α ‾ t ϵ t ) \frac{1}{\sqrt{\alpha_{t}}}\left(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t}\right) αt 1(Xt1αt βtϵt)。因为 X t X_t Xt µ θ µ_θ µθ的输入,在 t t t时刻是已知的,所以未知量只有 ϵ t ϵ_t ϵt。因此可以将 µ θ ( X t , t ) µ_θ(X_t,t) µθ(Xt,t)定义为:
μ θ ( X t , t ) = 1 α t ( X t − β t 1 − α ‾ t ϵ θ ( X t , t ) ) \mu_\theta\left(X_t,t\right)=\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta\left(X_t,\mathrm{t}\right)\right) μθ(Xt,t)=αt 1(Xt1αt βtϵθ(Xt,t))
所以有:
L t = E X 0 , ϵ [ ∣ ∣ 1 α t ( X t − β t 1 − α ‾ t ϵ t ) − 1 α t ( X t − β t 1 − α ‾ t ϵ θ ( X t , t ) ) ∣ ∣ 2 ] = E X 0 , ϵ [ β t 2 α t ( 1 − α ‾ t ) ∣ ∣ ϵ t − ϵ θ ( X t , t ) ∣ ∣ 2 ] ∝ E X 0 , ϵ [ ∣ ∣ ϵ t − ϵ θ ( X t , t ) ∣ ∣ 2 ] \begin{aligned} L_{t} & =E_{X_{0},\epsilon}\left[||\frac{1}{\sqrt{\alpha_{t}}}(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t})-\frac{1}{\sqrt{\alpha_{t}}}(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{\theta}(X_{t},\mathrm{t}))||^{2}\right] \\ & =E_{X_0,\epsilon}\left[\frac{\beta_t{}^2}{\alpha_t(1-\overline{\alpha}_t)}||\epsilon_t-\epsilon_\theta(X_t,\mathrm{t})||^2\right] \\ & \propto E_{X_{0},\epsilon}\left[||\epsilon_{t}-\epsilon_{\theta}(X_{t},\mathrm{t})||^{2}\right] \end{aligned} Lt=EX0,ϵ[∣∣αt 1(Xt1αt βtϵt)αt 1(Xt1αt βtϵθ(Xt,t))2]=EX0,ϵ[αt(1αt)βt2∣∣ϵtϵθ(Xt,t)2]EX0,ϵ[∣∣ϵtϵθ(Xt,t)2]
再将 X t = α ‾ t X 0 + 1 − α ‾ t ϵ t X_t=\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t Xt=αt X0+1αt ϵt带入上式

L t = E X 0 , ϵ [ ∣ ∣ ϵ t − ϵ θ ( α ‾ t X 0 + 1 − α ‾ t ϵ t , t ) ∣ ∣ 2 ] L_t=E_{X_0,\epsilon} \begin{bmatrix} ||\epsilon_t-\epsilon_\theta(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_\mathrm{t},t)||^2 \end{bmatrix} Lt=EX0,ϵ[∣∣ϵtϵθ(αt X0+1αt ϵt,t)2]

其中 α ‾ t X 0 + 1 − α ‾ t ϵ t \sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t αt X0+1αt ϵt其实是一个添加了高斯随机噪声的输入数据 , ϵ θ ( α ‾ t X 0 + 1 − α ‾ t ϵ t , t ) ,\epsilon_\theta(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t,t) ,ϵθ(αt X0+1αt ϵt,t) 表示一个输人为 α ‾ t X 0 + 1 − α ‾ t ε t \sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\varepsilon_t αt X0+1αt εt t t t,输出为 ϵ θ \epsilon_\theta ϵθ的噪声预测网络。所以 DDPM 网络做的事情其实是估计扩散过程中添加的噪声。

综上,只有 L t L_t Lt需要被优化。经过复杂的数学推导,DDPM 的损失函数其实就是上面的 L t L _t Lt
即需要优化一个 L2 loss。

五. 训练与推理

1. 训练流程

  1. 输入数据:从数据集中采样一个干净的数据样本 ( x 0 x_0 x0 )。
  2. 前向扩散过程
    • 逐步向 ( x 0 x_0 x0 ) 添加高斯噪声,生成一系列噪声数据 ( x 1 , x 2 , … , x T x_1, x_2, \dots, x_T x1,x2,,xT )。

    • 每一步的扩散公式为:

      x t = 1 − β t ⋅ x t − 1 + β t ⋅ ϵ t , ϵ t ∼ N ( 0 , I ) x_t = \sqrt{1 - \beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I) xt=1βt xt1+βt ϵt,ϵtN(0,I)

      其中 ( β t \beta_t βt ) 是噪声调度参数。

  3. 模型预测噪声
    • 对于每个时间步 ( t ),模型 ( N θ ( x t , t ) N_\theta(x_t, t) Nθ(xt,t) ) 预测添加到 ( x t x_t xt ) 中的噪声 ( ϵ t \epsilon_t ϵt )。
  4. 计算损失
    • 损失函数是预测噪声 ( N θ ( x t , t N_\theta(x_t, t Nθ(xt,t) ) 和实际噪声 ( ϵ t \epsilon_t ϵt ) 之间的均方误差(MSE):
      L ( θ ) = E t , x 0 , ϵ t [ ∥ ϵ t − N θ ( x t , t ) ∥ 2 ] \mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \epsilon_t} \left[ \left\| \epsilon_t - N_\theta(x_t, t) \right\|^2 \right] L(θ)=Et,x0,ϵt[ϵtNθ(xt,t)2]
  5. 反向传播更新参数
    • 通过梯度下降法更新模型参数 ( \theta ),最小化损失函数。

2. 推理(生成)流程

  1. 初始化
    • 从标准正态分布中采样一个随机噪声 x T ∼ N ( 0 , I ) x_T \sim \mathcal{N}(0, I) xTN(0,I)
  2. 反向去噪过程
    • 从 ( t = T t = T t=T ) 开始,逐步去噪生成 ( x T − 1 , x T − 2 , … , x 0 x_{T-1}, x_{T-2}, \dots, x_0 xT1,xT2,,x0 )。
    • 每一步的去噪公式为:
      x t − 1 = 1 1 − β t ( x t − β t 1 − α ˉ t N θ ( x t , t ) ) + β t ⋅ z , z ∼ N ( 0 , I ) x_{t-1} = \frac{1}{\sqrt{1 - \beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} N_\theta(x_t, t) \right) + \sqrt{\beta_t} \cdot z, \quad z \sim \mathcal{N}(0, I) xt1=1βt 1(xt1αˉt βtNθ(xt,t))+βt z,zN(0,I)
      其中:
      • ( α ˉ t = ∏ s = 1 t ( 1 − β s ) \bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s) αˉt=s=1t(1βs) ) 是累积噪声调度参数。
      • ( N θ ( x t , t ) N_\theta(x_t, t) Nθ(xt,t) ) 是模型预测的噪声。
      • ( z z z ) 是额外添加的噪声,用于保持随机性。
  3. 生成数据
    • 当 ( t = 0 t = 0 t=0 ) 时,得到生成的数据 ( x 0 x_0 x0 )。

3. 训练与推理的对比

步骤训练推理
输入干净数据 ( x_0 )随机噪声 ( x T ∼ N ( 0 , I ) x_T \sim \mathcal{N}(0, I) xTN(0,I) )
过程前向扩散(添加噪声) + 模型预测噪声 + 计算损失反向去噪(逐步生成)
目标最小化预测噪声与实际噪声的差异从噪声中生成高质量数据
时间步从 ( t = 1 ) 到 ( t = T )从 ( t = T ) 到 ( t = 0 )
模型作用预测每一步添加的噪声 ( ϵ t \epsilon_t ϵt )预测每一步的噪声 ( ϵ t \epsilon_t ϵt ),用于去噪
输出更新后的模型参数 ( θ \theta θ )生成的数据 ( x 0 x_0 x0 )

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值