Denoising Diffusion Probabilistic Models – 概率扩散模型 数学推导(1)
三. 学习目标
现在回顾前面扩散过程和生成过程的公式
- 扩散过程 q ( x t ∣ x 0 ) ∼ N ( α t ˉ x 0 , 1 − α t ˉ I ) q(x_t|x_0) \sim \mathcal{N}(\sqrt{\bar{\alpha_t}}x_0, \sqrt{1-\bar{\alpha_t}}I) q(xt∣x0)∼N(αtˉx0,1−αtˉI)
- 生成过程: q ( X t − 1 ∣ X t , X 0 ) ∼ N ( X t − 1 ; 1 α t ( X t − 1 − α t 1 − α ‾ t ϵ t ) , 1 − α ‾ t − 1 1 − α ‾ t ( 1 − α t ) I ) q\left(X_{t-1}|X_t,X_0\right) \sim N\left(X_{t-1};\frac{1}{\sqrt{\alpha_t}}(X_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_t),\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)I\right) q(Xt−1∣Xt,X0)∼N(Xt−1;αt1(Xt−1−αt1−αtϵt),1−αt1−αt−1(1−αt)I)
对于扩散过程, 都是已知的参数, 故而不需要神经网络进行训练
对于生成过程, 方差 1 − α ‾ t − 1 1 − α ‾ t ( 1 − α t ) \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right) 1−αt1−αt−1(1−αt)是已知参数, 但均值中的 ϵ t \epsilon_t ϵt 就是前向过程中添加的高斯噪声, 在生成过程, 我们就是要使用神经网络预测添加的随机高斯噪声, 然后从 X 0 X_0 X0中去除 ϵ t \epsilon_t ϵt, 所以 ϵ t \epsilon_t ϵt就是唯一的训练参数, 记为 θ \theta θ
所以:
p
θ
(
X
t
−
1
∣
X
t
)
=
q
(
X
t
−
1
∣
X
t
)
=
N
(
X
t
−
1
;
1
α
t
(
X
t
−
1
−
α
t
1
−
α
‾
t
ϵ
t
)
,
1
−
α
‾
t
−
1
1
−
α
‾
t
(
1
−
α
t
)
I
)
\begin{aligned} p_{\theta}(X_{t-1}|X_t) &= q(X_{t-1}|X_t) \\ &=N\left(X_{t-1};\frac{1}{\sqrt{\alpha_t}}(X_t-\frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_t),\frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\left(1-\alpha_t\right)I\right) \\ \end{aligned}
pθ(Xt−1∣Xt)=q(Xt−1∣Xt)=N(Xt−1;αt1(Xt−1−αt1−αtϵt),1−αt1−αt−1(1−αt)I)
生成过程也就可以表示为:
p
θ
(
X
0
:
t
)
=
p
(
X
t
)
∏
t
=
1
T
p
θ
(
X
t
−
1
∣
X
t
)
p_{\theta}(X_{0:t}) = p(X_t)\prod_{t=1}^Tp_{\theta}(X_{t-1}|X_t)
pθ(X0:t)=p(Xt)t=1∏Tpθ(Xt−1∣Xt)
四. 损失函数
通过前面的推导可知, 我们在生成过程实际上是想找到一个
θ
\theta
θ, 使得
p
θ
(
X
0
)
p_{\theta}(X_0)
pθ(X0)最大, 我们可以使用概率模型中常见的极大似然估计来估计参数
θ
\theta
θ:
arg max
θ
log
p
θ
(
X
0
)
\argmax_{\theta}\log p_{\theta}(X_0)
θargmaxlogpθ(X0)
arg max
θ
log
p
θ
(
X
0
)
=
∫
X
1
:
X
T
q
(
X
1
:
X
T
∣
X
0
)
log
p
θ
(
X
0
)
d
X
1
:
X
T
=
∫
X
1
:
X
T
q
(
X
1
:
X
T
∣
X
0
)
log
p
θ
(
X
0
:
X
T
)
p
θ
(
X
1
:
X
T
∣
X
0
)
d
X
1
:
X
T
=
∫
X
1
:
X
T
q
(
X
1
:
X
T
∣
X
0
)
log
p
θ
(
X
0
:
X
T
)
q
(
X
1
:
X
T
∣
X
0
)
p
θ
(
X
1
:
X
T
∣
X
0
)
q
(
X
1
:
X
T
∣
X
0
)
d
X
1
:
X
T
=
∫
X
1
:
X
T
q
(
X
1
:
X
T
∣
X
0
)
log
p
θ
(
X
0
:
X
T
)
q
(
X
1
:
X
T
∣
X
0
)
d
X
1
:
X
T
+
∫
X
1
:
X
T
q
(
X
1
:
X
T
∣
X
0
)
log
q
(
X
1
:
X
T
∣
X
0
)
p
θ
(
X
1
:
X
T
∣
X
0
)
d
X
1
:
X
T
=
∫
X
1
:
X
T
q
(
X
1
:
X
T
∣
X
0
)
log
p
θ
(
X
0
:
X
T
)
q
(
X
1
:
X
T
∣
X
0
)
d
X
1
:
X
T
+
D
K
L
(
q
(
X
1
:
X
T
∣
X
0
)
∣
∣
p
θ
(
X
1
:
X
T
∣
X
0
)
)
因为
K
L
散度大于等于
0
⩾
∫
X
1
:
X
T
q
(
X
1
:
X
T
∣
X
0
)
log
p
θ
(
X
0
:
X
T
)
q
(
X
1
:
X
T
∣
X
0
)
d
X
1
:
X
T
=
E
X
1
:
X
T
∼
q
(
X
1
:
X
T
∣
X
0
)
[
log
p
θ
(
X
0
:
X
T
)
q
(
X
1
:
X
T
∣
X
0
)
]
\begin{aligned} \argmax_{\theta}\log p_{\theta}(X_0)&=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log p_{\theta}(X_{0})dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{p_{\theta}(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{1}:X_{T}|X_{0})q(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}}\\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}}+\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}} \\ &=\int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}}+D_{KL}(q(X_{1}:X_{T}|X_{0}) || p_{\theta}(X_{1}:X_{T}|X_{0})) \\ &因为KL 散度大于等于0 \\ &\geqslant \int_{X_{1}:X_{T}}q(X_{1}:X_{T}|X_{0})\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}dX_{1:X_{T}} \\ &= E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}[\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}] \end{aligned}
θargmaxlogpθ(X0)=∫X1:XTq(X1:XT∣X0)logpθ(X0)dX1:XT=∫X1:XTq(X1:XT∣X0)logpθ(X1:XT∣X0)pθ(X0:XT)dX1:XT=∫X1:XTq(X1:XT∣X0)logpθ(X1:XT∣X0)q(X1:XT∣X0)pθ(X0:XT)q(X1:XT∣X0)dX1:XT=∫X1:XTq(X1:XT∣X0)logq(X1:XT∣X0)pθ(X0:XT)dX1:XT+∫X1:XTq(X1:XT∣X0)logpθ(X1:XT∣X0)q(X1:XT∣X0)dX1:XT=∫X1:XTq(X1:XT∣X0)logq(X1:XT∣X0)pθ(X0:XT)dX1:XT+DKL(q(X1:XT∣X0)∣∣pθ(X1:XT∣X0))因为KL散度大于等于0⩾∫X1:XTq(X1:XT∣X0)logq(X1:XT∣X0)pθ(X0:XT)dX1:XT=EX1:XT∼q(X1:XT∣X0)[logq(X1:XT∣X0)pθ(X0:XT)]
两边取负号
−
log
p
θ
(
X
0
)
⩽
−
E
X
1
:
X
T
∼
q
(
X
1
:
X
T
∣
X
0
)
[
log
p
θ
(
X
0
:
X
T
)
q
(
X
1
:
X
T
∣
X
0
)
]
⩽
E
X
1
:
X
T
∼
q
(
X
1
:
X
T
∣
X
0
)
[
log
q
(
X
1
:
X
T
∣
X
0
)
p
θ
(
X
0
:
X
T
)
]
\begin{aligned} -\log p_{\theta}(X_0) \leqslant - E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}[\log\frac{p_{\theta}(X_{0}:X_{T})}{q(X_{1}:X_{T}|X_{0})}] \\ \leqslant E_{X_{1}:X_{T}\sim q(X_{1}:X_{T}|X_{0})}[\log\frac{q(X_{1}:X_{T}|X_{0})}{p_{\theta}(X_{0}:X_{T})}] \end{aligned}
−logpθ(X0)⩽−EX1:XT∼q(X1:XT∣X0)[logq(X1:XT∣X0)pθ(X0:XT)]⩽EX1:XT∼q(X1:XT∣X0)[logpθ(X0:XT)q(X1:XT∣X0)]
即得到了:
L
V
L
B
=
E
q
(
X
0
:
T
)
[
l
o
g
q
(
X
1
:
T
∣
X
0
)
p
θ
(
X
0
:
T
)
]
≥
−
log
p
θ
(
X
0
)
\begin{aligned} L_{VLB }= E_{\mathrm{q(X_{0:T})}}\left[log\frac{q(X_{1:T}|X_{0})}{p_{\theta}(X_{0:T})}\right] &\geq -\log p_{\theta}(X_0) \\ \end{aligned}
LVLB=Eq(X0:T)[logpθ(X0:T)q(X1:T∣X0)]≥−logpθ(X0)
那么最小化
−
log
p
θ
(
X
0
)
-\log p_{\theta}(X_0)
−logpθ(X0)也就变成了最小化其上界
L
V
L
B
L_{VLB }
LVLB, 现在对
L
V
L
B
L_{VLB}
LVLB进行进一步解析:
L
V
L
B
=
E
q
(
X
0
:
T
)
[
l
o
g
q
(
X
1
:
T
∣
X
0
)
p
θ
(
X
0
:
T
)
]
=
E
q
(
X
0
:
T
)
[
log
∏
t
=
1
T
q
(
X
t
∣
X
t
−
1
)
p
θ
(
X
T
)
∏
t
=
1
T
p
θ
(
X
t
−
1
∣
X
t
)
]
=
E
q
(
X
0
:
T
)
[
−
log
p
θ
(
X
T
)
+
∑
t
=
1
T
log
q
(
X
t
∣
X
t
−
1
)
p
θ
(
X
t
−
1
∣
X
t
)
]
=
E
q
(
X
0
:
T
)
[
−
log
p
θ
(
X
T
)
+
∑
t
=
2
T
log
q
(
X
t
∣
X
t
−
1
)
p
θ
(
X
t
−
1
∣
X
t
)
+
log
q
(
X
1
∣
X
0
)
p
θ
(
X
0
∣
X
1
)
]
\begin{aligned} L_{VLB }&= E_{\mathrm{q(X_{0:T})}}\left[log\frac{q(X_{1:T}|X_{0})}{p_{\theta}(X_{0:T})}\right] \\ &= E_{\mathrm{q(X_{0:T})}}\left[\log \frac{\prod_{t=1}^T q(X_{t}\mid X_{t-1})}{p_{\theta}(X_T)\prod_{t=1}^T p_{\theta}(X_{t-1}\mid X_t)} \right]\\ &= E_{\mathrm{q(X_{0:T})}}\left[-\log p_{\theta}(X_T) + \sum ^T_{t=1}\log \frac{q(X_{t}\mid X_{t-1})}{p_{\theta}(X_{t-1}\mid X_t)} \right]\\ &= E_{\mathrm{q(X_{0:T})}}\left[-\log p_{\theta}(X_T) + \sum ^T_{t=2}\log \frac{q(X_{t}\mid X_{t-1})}{p_{\theta}(X_{t-1}\mid X_t)} +\log \frac{q(X_{1}\mid X_{0})}{p_{\theta}(X_{0}\mid X_1)} \right]\\ \end{aligned}
LVLB=Eq(X0:T)[logpθ(X0:T)q(X1:T∣X0)]=Eq(X0:T)[logpθ(XT)∏t=1Tpθ(Xt−1∣Xt)∏t=1Tq(Xt∣Xt−1)]=Eq(X0:T)[−logpθ(XT)+t=1∑Tlogpθ(Xt−1∣Xt)q(Xt∣Xt−1)]=Eq(X0:T)[−logpθ(XT)+t=2∑Tlogpθ(Xt−1∣Xt)q(Xt∣Xt−1)+logpθ(X0∣X1)q(X1∣X0)]
由马尔科夫性质
,
有
:
q
(
X
t
∣
X
t
−
1
)
=
q
(
X
t
∣
X
t
−
1
,
X
0
)
=
q
(
X
t
,
X
t
−
1
,
X
0
)
q
(
X
t
−
1
,
X
0
)
=
q
(
X
t
−
1
∣
X
t
,
X
0
)
q
(
X
t
∣
X
0
)
q
(
X
t
−
1
∣
X
0
)
\begin{aligned} 由马尔科夫性质, 有:q\left(X_t \mid X_{t-1}\right)&=q\left(X_t \mid X_{t-1}, X_0\right)\\ & =\frac{q\left(X_t, X_{t-1}, X_0\right)}{q\left(X_{t-1}, X_0\right)} \\ & =\frac{q\left(X_{t-1} \mid X_t, X_0\right) q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)} \\ \end{aligned}
由马尔科夫性质,有:q(Xt∣Xt−1)=q(Xt∣Xt−1,X0)=q(Xt−1,X0)q(Xt,Xt−1,X0)=q(Xt−1∣X0)q(Xt−1∣Xt,X0)q(Xt∣X0)
L
V
L
B
=
E
q
[
−
log
p
θ
(
X
T
)
+
∑
t
=
2
T
log
(
q
(
X
t
−
1
∣
X
t
,
X
0
)
p
θ
(
X
t
−
1
∣
X
t
)
⋅
q
(
X
t
∣
X
0
)
q
(
X
t
−
1
∣
X
0
)
)
+
log
q
(
X
1
∣
X
0
)
p
θ
(
X
0
∣
X
1
)
]
=
E
q
[
−
log
p
θ
(
X
T
)
+
∑
t
=
2
T
log
q
(
X
t
−
1
∣
X
t
,
X
0
)
p
θ
(
X
t
−
1
∣
X
t
)
+
∑
t
=
2
T
log
q
(
X
t
∣
X
0
)
q
(
X
t
−
1
∣
X
0
)
+
log
q
(
X
1
∣
X
0
)
p
θ
(
X
0
∣
X
1
)
]
=
E
q
[
−
log
p
θ
(
X
T
)
+
∑
t
=
2
T
log
q
(
X
t
−
1
∣
X
t
,
X
0
)
p
θ
(
X
t
−
1
∣
X
t
)
+
log
q
(
X
T
∣
X
0
)
q
(
X
1
∣
X
0
)
+
log
q
(
X
1
∣
X
0
)
p
θ
(
X
0
∣
X
1
)
]
=
E
q
[
log
1
p
θ
(
X
T
)
+
∑
t
=
2
T
log
q
(
X
t
−
1
∣
X
t
,
X
0
)
p
θ
(
X
t
−
1
∣
X
t
)
+
log
q
(
X
T
∣
X
0
)
q
(
X
1
∣
X
0
)
q
(
X
1
∣
X
0
)
p
θ
(
X
0
∣
X
1
)
]
=
E
q
[
log
1
p
θ
(
X
T
)
+
∑
t
=
2
T
log
q
(
X
t
−
1
∣
X
t
,
X
0
)
p
θ
(
X
t
−
1
∣
X
t
)
+
log
q
(
X
T
∣
X
0
)
+
log
1
p
θ
(
X
0
∣
X
1
)
]
=
E
q
[
log
q
(
X
t
∣
X
0
)
p
θ
(
X
t
)
+
∑
t
=
2
T
log
q
(
X
t
−
1
∣
X
t
,
X
0
)
p
θ
(
X
t
−
1
∣
X
t
)
−
log
p
θ
(
X
0
∣
X
1
)
]
=
E
q
[
D
K
L
(
q
(
X
t
∣
X
0
)
∥
p
θ
(
X
t
)
)
+
∑
t
=
2
T
D
K
L
(
q
(
X
t
−
1
∣
X
t
,
X
0
)
∥
p
θ
(
X
t
−
1
∣
X
t
)
)
−
log
p
θ
(
X
0
∣
X
1
)
]
\begin{aligned} L_{V L B} & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \left(\frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)} \cdot \frac{q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)}\right)+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\sum_{t=2}^T \log \frac{q\left(X_t \mid X_0\right)}{q\left(X_{t-1} \mid X_0\right)}+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[-\log p_\theta\left(X_T\right)+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log \frac{q\left(X_T \mid X_0\right)}{q\left(X_1 \mid X_0\right)}+\log \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{1}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log \frac{q\left(X_T \mid X_0\right)}{q\left(X_1 \mid X_0\right)} \frac{q\left(X_1 \mid X_0\right)}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{1}{p_\theta\left(X_T\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}+\log q\left(X_T \mid X_0\right) +\log\frac{1}{p_\theta\left(X_0 \mid X_1\right)}\right] \\ & =E_{\mathrm{q}}\left[\log \frac{q\left(X_t \mid X_0\right)}{p_\theta\left(X_t\right)}+\sum_{t=2}^T \log \frac{q\left(X_{t-1} \mid X_t, X_0\right)}{p_\theta\left(X_{t-1} \mid X_t\right)}-\log p_\theta\left(X_0 \mid X_1\right)\right] \\ & =E_{\mathrm{q}}\left[D_{K L}\left(q\left(X_t \mid X_0\right) \| p_\theta\left(X_t\right)\right)+\sum_{t=2}^T D_{K L}\left(q\left(X_{t-1} \mid X_t, X_0\right) \| p_\theta\left(X_{t-1} \mid X_t\right)\right)-\log p_\theta\left(X_0 \mid X_1\right)\right] \end{aligned}
LVLB=Eq[−logpθ(XT)+t=2∑Tlog(pθ(Xt−1∣Xt)q(Xt−1∣Xt,X0)⋅q(Xt−1∣X0)q(Xt∣X0))+logpθ(X0∣X1)q(X1∣X0)]=Eq[−logpθ(XT)+t=2∑Tlogpθ(Xt−1∣Xt)q(Xt−1∣Xt,X0)+t=2∑Tlogq(Xt−1∣X0)q(Xt∣X0)+logpθ(X0∣X1)q(X1∣X0)]=Eq[−logpθ(XT)+t=2∑Tlogpθ(Xt−1∣Xt)q(Xt−1∣Xt,X0)+logq(X1∣X0)q(XT∣X0)+logpθ(X0∣X1)q(X1∣X0)]=Eq[logpθ(XT)1+t=2∑Tlogpθ(Xt−1∣Xt)q(Xt−1∣Xt,X0)+logq(X1∣X0)q(XT∣X0)pθ(X0∣X1)q(X1∣X0)]=Eq[logpθ(XT)1+t=2∑Tlogpθ(Xt−1∣Xt)q(Xt−1∣Xt,X0)+logq(XT∣X0)+logpθ(X0∣X1)1]=Eq[logpθ(Xt)q(Xt∣X0)+t=2∑Tlogpθ(Xt−1∣Xt)q(Xt−1∣Xt,X0)−logpθ(X0∣X1)]=Eq[DKL(q(Xt∣X0)∥pθ(Xt))+t=2∑TDKL(q(Xt−1∣Xt,X0)∥pθ(Xt−1∣Xt))−logpθ(X0∣X1)]
令
L
V
L
B
=
L
T
+
L
T
−
1
+
…
+
L
0
L_{VLB} = L_T + L_{T-1} + \ldots + L_0
LVLB=LT+LT−1+…+L0,其中
L
T
=
D
K
L
(
q
(
X
t
∣
X
0
)
∥
p
θ
(
X
t
)
)
L
t
=
D
K
L
(
q
(
X
t
−
1
∣
X
t
,
X
0
)
∥
p
θ
(
X
t
−
1
∣
X
t
)
)
,
1
≤
t
≤
T
−
1
L
0
=
−
log
p
θ
(
X
0
∣
X
1
)
\begin{aligned} L_T &= D_{KL}\left(q\left(X_t|X_0\right) \parallel p_\theta\left(X_t\right)\right)\\ L_t &= D_{KL}\left(q\left(X_{t-1}|X_t, X_0\right) \parallel p_\theta\left(X_{t-1}|X_t\right)\right), \quad 1 \leq t \leq T-1\\ L_0 &= -\log p_\theta\left(X_0|X_1\right) \end{aligned}
LTLtL0=DKL(q(Xt∣X0)∥pθ(Xt))=DKL(q(Xt−1∣Xt,X0)∥pθ(Xt−1∣Xt)),1≤t≤T−1=−logpθ(X0∣X1)
接下来分别研究
L
T
,
L
t
L_T,L_t
LT,Lt和
L
0
:
L_0:
L0:
-
L T L_T LT不需要进行优化;因为 q ( X T ∣ X 0 ) q\left(X_T|X_0\right) q(XT∣X0)是已知的前向过程, p θ ( X T ) p_\theta\left(X_T\right) pθ(XT)是已知的纯高斯噪声的
分布。因此 L T L_T LT已知,可以视为一个常数。 -
L 0 L_0 L0也不需要进行优化。DDPM 将 p θ ( X 0 ∣ X 1 ) p_\theta\left(X_0|X_1\right) pθ(X0∣X1)设置为了一个固定的过程,是一个从高斯分
布中导出的独立的离散形式的编码过程。
对于 L t L_t Lt:
- q ( X t − 1 ∣ X t , X 0 ) = N ( X t ; μ ~ ( X t + 1 , X 0 ) , β ~ t I ) q\left(X_{t-1}|X_t,X_0\right)=N\left(X_{t};\tilde{\mu}\left(X_{t+1},X_0\right),\widetilde{\beta}_tI\right) q(Xt−1∣Xt,X0)=N(Xt;μ~(Xt+1,X0),β tI) 是可以求出来的
- p θ ( X t ∣ X t + 1 ) = N ( X t ; μ θ ( X t , t ) , Σ θ ( X t , t ) ) p_{\theta}\left(X_{t}|X_{t+1}\right)=N\left(X_{t};\mu_{\theta}\left(X_{t},t\right),\Sigma_{\theta}\left(X_{t},t\right)\right) pθ(Xt∣Xt+1)=N(Xt;μθ(Xt,t),Σθ(Xt,t)), 是网络期望拟合的目标函数
由两高斯函数的KL散度为:
D
K
L
(
P
∣
∣
Q
)
=
l
o
g
σ
2
σ
1
+
σ
1
2
+
(
μ
1
−
μ
2
)
2
2
σ
2
2
−
1
2
D_{KL}(P||Q)=log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+\left(\mu_1-\mu_2\right)^2}{2\sigma_2^2}-\frac{1}{2}
DKL(P∣∣Q)=logσ1σ2+2σ22σ12+(μ1−μ2)2−21
且
q
(
X
t
−
1
∣
X
t
,
X
0
)
q\left(X_{t-1}|X_t,X_0\right)
q(Xt−1∣Xt,X0)与
p
θ
(
X
t
∣
X
t
+
1
)
p_{\theta}\left(X_{t}|X_{t+1}\right)
pθ(Xt∣Xt+1)的方差都是常数,所以需要优化的是这
两个高斯分布的均值的二范数
(
μ
1
−
μ
2
)
2
(\mu_1−\mu_2)^2
(μ1−μ2)2,即优化:
L
t
=
E
q
[
∣
∣
μ
~
(
X
t
+
1
,
X
0
)
−
μ
θ
(
X
t
,
t
)
∣
∣
2
]
=
E
X
0
,
ϵ
[
∣
∣
1
α
t
(
X
t
−
β
t
1
−
α
‾
t
ϵ
t
)
−
μ
θ
(
X
t
,
t
)
∣
∣
2
]
\begin{aligned} L_{t} & =E_{q}\left[||\tilde{\mu}\left(X_{t+1},X_{0}\right)-\mu_{\theta}\left(X_{t},t\right)||^{2}\right] \\ & =E_{X_{0},\epsilon}\left[||\frac{1}{\sqrt{\alpha_{t}}}\left(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t}\right)-\mu_{\theta}\left(X_{t},t\right)||^{2}\right] \end{aligned}
Lt=Eq[∣∣μ~(Xt+1,X0)−μθ(Xt,t)∣∣2]=EX0,ϵ[∣∣αt1(Xt−1−αtβtϵt)−μθ(Xt,t)∣∣2]
可以发现
μ
θ
(
X
t
,
t
)
\mu_{\theta}\left(X_{t},t\right)
μθ(Xt,t)的优化目标是尽可能地接近
1
α
t
(
X
t
−
β
t
1
−
α
‾
t
ϵ
t
)
\frac{1}{\sqrt{\alpha_{t}}}\left(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t}\right)
αt1(Xt−1−αtβtϵt)。因为
X
t
X_t
Xt是
µ
θ
µ_θ
µθ的输入,在
t
t
t时刻是已知的,所以未知量只有
ϵ
t
ϵ_t
ϵt。因此可以将
µ
θ
(
X
t
,
t
)
µ_θ(X_t,t)
µθ(Xt,t)定义为:
μ
θ
(
X
t
,
t
)
=
1
α
t
(
X
t
−
β
t
1
−
α
‾
t
ϵ
θ
(
X
t
,
t
)
)
\mu_\theta\left(X_t,t\right)=\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta\left(X_t,\mathrm{t}\right)\right)
μθ(Xt,t)=αt1(Xt−1−αtβtϵθ(Xt,t))
所以有:
L
t
=
E
X
0
,
ϵ
[
∣
∣
1
α
t
(
X
t
−
β
t
1
−
α
‾
t
ϵ
t
)
−
1
α
t
(
X
t
−
β
t
1
−
α
‾
t
ϵ
θ
(
X
t
,
t
)
)
∣
∣
2
]
=
E
X
0
,
ϵ
[
β
t
2
α
t
(
1
−
α
‾
t
)
∣
∣
ϵ
t
−
ϵ
θ
(
X
t
,
t
)
∣
∣
2
]
∝
E
X
0
,
ϵ
[
∣
∣
ϵ
t
−
ϵ
θ
(
X
t
,
t
)
∣
∣
2
]
\begin{aligned} L_{t} & =E_{X_{0},\epsilon}\left[||\frac{1}{\sqrt{\alpha_{t}}}(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{t})-\frac{1}{\sqrt{\alpha_{t}}}(X_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\epsilon_{\theta}(X_{t},\mathrm{t}))||^{2}\right] \\ & =E_{X_0,\epsilon}\left[\frac{\beta_t{}^2}{\alpha_t(1-\overline{\alpha}_t)}||\epsilon_t-\epsilon_\theta(X_t,\mathrm{t})||^2\right] \\ & \propto E_{X_{0},\epsilon}\left[||\epsilon_{t}-\epsilon_{\theta}(X_{t},\mathrm{t})||^{2}\right] \end{aligned}
Lt=EX0,ϵ[∣∣αt1(Xt−1−αtβtϵt)−αt1(Xt−1−αtβtϵθ(Xt,t))∣∣2]=EX0,ϵ[αt(1−αt)βt2∣∣ϵt−ϵθ(Xt,t)∣∣2]∝EX0,ϵ[∣∣ϵt−ϵθ(Xt,t)∣∣2]
再将
X
t
=
α
‾
t
X
0
+
1
−
α
‾
t
ϵ
t
X_t=\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t
Xt=αtX0+1−αtϵt带入上式
L t = E X 0 , ϵ [ ∣ ∣ ϵ t − ϵ θ ( α ‾ t X 0 + 1 − α ‾ t ϵ t , t ) ∣ ∣ 2 ] L_t=E_{X_0,\epsilon} \begin{bmatrix} ||\epsilon_t-\epsilon_\theta(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_\mathrm{t},t)||^2 \end{bmatrix} Lt=EX0,ϵ[∣∣ϵt−ϵθ(αtX0+1−αtϵt,t)∣∣2]
其中 α ‾ t X 0 + 1 − α ‾ t ϵ t \sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t αtX0+1−αtϵt其实是一个添加了高斯随机噪声的输入数据 , ϵ θ ( α ‾ t X 0 + 1 − α ‾ t ϵ t , t ) ,\epsilon_\theta(\sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\epsilon_t,t) ,ϵθ(αtX0+1−αtϵt,t) 表示一个输人为 α ‾ t X 0 + 1 − α ‾ t ε t \sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t}\varepsilon_t αtX0+1−αtεt和 t t t,输出为 ϵ θ \epsilon_\theta ϵθ的噪声预测网络。所以 DDPM 网络做的事情其实是估计扩散过程中添加的噪声。
综上,只有
L
t
L_t
Lt需要被优化。经过复杂的数学推导,DDPM 的损失函数其实就是上面的
L
t
L _t
Lt
即需要优化一个 L2 loss。
五. 训练与推理
1. 训练流程
- 输入数据:从数据集中采样一个干净的数据样本 ( x 0 x_0 x0 )。
- 前向扩散过程:
-
逐步向 ( x 0 x_0 x0 ) 添加高斯噪声,生成一系列噪声数据 ( x 1 , x 2 , … , x T x_1, x_2, \dots, x_T x1,x2,…,xT )。
-
每一步的扩散公式为:
x t = 1 − β t ⋅ x t − 1 + β t ⋅ ϵ t , ϵ t ∼ N ( 0 , I ) x_t = \sqrt{1 - \beta_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I) xt=1−βt⋅xt−1+βt⋅ϵt,ϵt∼N(0,I)
其中 ( β t \beta_t βt ) 是噪声调度参数。
-
- 模型预测噪声:
- 对于每个时间步 ( t ),模型 ( N θ ( x t , t ) N_\theta(x_t, t) Nθ(xt,t) ) 预测添加到 ( x t x_t xt ) 中的噪声 ( ϵ t \epsilon_t ϵt )。
- 计算损失:
- 损失函数是预测噪声 (
N
θ
(
x
t
,
t
N_\theta(x_t, t
Nθ(xt,t) ) 和实际噪声 (
ϵ
t
\epsilon_t
ϵt ) 之间的均方误差(MSE):
L ( θ ) = E t , x 0 , ϵ t [ ∥ ϵ t − N θ ( x t , t ) ∥ 2 ] \mathcal{L}(\theta) = \mathbb{E}_{t, x_0, \epsilon_t} \left[ \left\| \epsilon_t - N_\theta(x_t, t) \right\|^2 \right] L(θ)=Et,x0,ϵt[∥ϵt−Nθ(xt,t)∥2]
- 损失函数是预测噪声 (
N
θ
(
x
t
,
t
N_\theta(x_t, t
Nθ(xt,t) ) 和实际噪声 (
ϵ
t
\epsilon_t
ϵt ) 之间的均方误差(MSE):
- 反向传播更新参数:
- 通过梯度下降法更新模型参数 ( \theta ),最小化损失函数。
2. 推理(生成)流程
- 初始化:
- 从标准正态分布中采样一个随机噪声 x T ∼ N ( 0 , I ) x_T \sim \mathcal{N}(0, I) xT∼N(0,I) 。
- 反向去噪过程:
- 从 ( t = T t = T t=T ) 开始,逐步去噪生成 ( x T − 1 , x T − 2 , … , x 0 x_{T-1}, x_{T-2}, \dots, x_0 xT−1,xT−2,…,x0 )。
- 每一步的去噪公式为:
x t − 1 = 1 1 − β t ( x t − β t 1 − α ˉ t N θ ( x t , t ) ) + β t ⋅ z , z ∼ N ( 0 , I ) x_{t-1} = \frac{1}{\sqrt{1 - \beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} N_\theta(x_t, t) \right) + \sqrt{\beta_t} \cdot z, \quad z \sim \mathcal{N}(0, I) xt−1=1−βt1(xt−1−αˉtβtNθ(xt,t))+βt⋅z,z∼N(0,I)
其中:- ( α ˉ t = ∏ s = 1 t ( 1 − β s ) \bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s) αˉt=∏s=1t(1−βs) ) 是累积噪声调度参数。
- ( N θ ( x t , t ) N_\theta(x_t, t) Nθ(xt,t) ) 是模型预测的噪声。
- ( z z z ) 是额外添加的噪声,用于保持随机性。
- 生成数据:
- 当 ( t = 0 t = 0 t=0 ) 时,得到生成的数据 ( x 0 x_0 x0 )。
3. 训练与推理的对比
步骤 | 训练 | 推理 |
---|---|---|
输入 | 干净数据 ( x_0 ) | 随机噪声 ( x T ∼ N ( 0 , I ) x_T \sim \mathcal{N}(0, I) xT∼N(0,I) ) |
过程 | 前向扩散(添加噪声) + 模型预测噪声 + 计算损失 | 反向去噪(逐步生成) |
目标 | 最小化预测噪声与实际噪声的差异 | 从噪声中生成高质量数据 |
时间步 | 从 ( t = 1 ) 到 ( t = T ) | 从 ( t = T ) 到 ( t = 0 ) |
模型作用 | 预测每一步添加的噪声 ( ϵ t \epsilon_t ϵt ) | 预测每一步的噪声 ( ϵ t \epsilon_t ϵt ),用于去噪 |
输出 | 更新后的模型参数 ( θ \theta θ ) | 生成的数据 ( x 0 x_0 x0 ) |