摘要
Image-to-Image Translation with Conditional Adversarial Networks 提出了利用GAN网络进行成对图片生成方法。
内容
网络的结构如下:
G接收成对图片{edge, photo}用于训练, 最终目的是训练一个G能够通过edge生成photo。
D则要努力分清fake img(也就是G(x))和real img,如果输入是real img, D(real img)应该更接近1, 如果是fake img,D(fake img)应该接近0, 但是与一般的gan网络不同的是, 这个D需要对应的edge图片和fake img/ real img 两部分都要输入,见上图。
论文中定义的目标函数有两个gan loss和L1 loss。
L
G
A
N
(
G
,
D
)
=
E
y
[
log
D
(
y
)
]
+
E
x
,
z
[
log
(
1
−
D
(
G
(
x
,
z
)
)
]
L
L
1
(
G
)
=
E
x
,
y
,
z
[
∥
y
−
G
(
x
,
z
)
∥
1
]
\begin{aligned} \mathcal{L}_{G A N}(G, D)=& \mathbb{E}_{y}[\log D(y)]+ \mathbb{E}_{x, z}[\log (1-D(G(x, z))] \end{aligned} \\ \mathcal{L}_{L 1}(G)=\mathbb{E}_{x, y, z}\left[\|y-G(x, z)\|_{1}\right]
LGAN(G,D)=Ey[logD(y)]+Ex,z[log(1−D(G(x,z))]LL1(G)=Ex,y,z[∥y−G(x,z)∥1]
这里的x为输入edge, z为随机噪音,如果没有随即噪音z,G会也可以将x映射成y,但是只会产生很确定性的输出,最终再model G上通过dropout层来实现了随即噪声的引入。直接生成随机数作为noise的方法可以参考这个代码片段.
从上面两个loss不难看出,在训练D时, 仅仅考虑Loss Gan就行了, 但是在训练G时,则既要考虑loss gan, 还要考虑loss L1.
写在一起有:
G
∗
=
arg
min
G
max
D
L
c
G
A
N
(
G
,
D
)
+
λ
L
L
1
(
G
)
G^{*}=\arg \min _{G} \max _{D} \mathcal{L}_{c G A N}(G, D)+\lambda \mathcal{L}_{L 1}(G)
G∗=argGminDmaxLcGAN(G,D)+λLL1(G)
这个公式初看会比较复杂难懂, 结合代码看可能会容易理解些。
解释起来就是:在训练G时, G应该使上式越小越好。
对于G有:
L
G
(
G
,
D
)
=
E
y
[
log
D
(
y
)
]
+
E
x
,
y
,
z
[
∥
y
−
G
(
x
,
z
)
∥
1
]
\begin{aligned} \mathcal{L}_{G}(G, D)=& \mathbb{E}_{y}[\log D(y)] + \mathbb{E}_{x, y, z}\left[\|y-G(x, z)\|_{1}\right] \end{aligned}
LG(G,D)=Ey[logD(y)]+Ex,y,z[∥y−G(x,z)∥1]
代码:
#Update G network: maximize log(D(x, G(x, z))) - lambda1 * L1(y, G(x, z))
fake_out = netG(real_in)
fake_concat = nd.concat(real_in, fake_out, dim=1)
output = netD(fake_concat)
real_label = nd.ones(output.shape, ctx=ctx)
errG = GAN_loss(output, real_label) + L1_loss(real_out, fake_out) * lambda1
errG.backward()
注意,原论文有句话:
As suggested in the original GAN paper, rather than training G to minimize log(1 − D(x, G(x, z)), we instead train to maximize log D(x, G(x, z))
对于D有:
L
D
(
G
,
D
)
=
E
y
[
log
D
(
y
)
]
+
E
x
,
z
[
log
(
1
−
D
(
G
(
x
,
z
)
)
]
\begin{aligned} \mathcal{L}_{D}(G, D)=& \mathbb{E}_{y}[\log D(y)]+ \mathbb{E}_{x, z}[\log (1-D(G(x, z))] \end{aligned}
LD(G,D)=Ey[logD(y)]+Ex,z[log(1−D(G(x,z))]
代码:
#Update D network: maximize log(D(x, y)) + log(1 - D(x, G(x, z)))
output = netD(fake_concat)
fake_label = nd.zeros(output.shape, ctx=ctx)
errD_fake = GAN_loss(output, fake_label)
# Train with real image
real_concat = nd.concat(real_in, real_out, dim=1)
output = netD(real_concat)
real_label = nd.ones(output.shape, ctx=ctx)
errD_real = GAN_loss(output, real_label)
errD = (errD_real + errD_fake) * 0.5
errD.backward()
其中loss的定义如下:
GAN_loss = gluon.loss.SigmoidBinaryCrossEntropyLoss()
L1_loss = gluon.loss.L1Loss()
SigmoidBinaryCrossEntropyLoss定义如下:
L = − ∑ i ( L=-\sum_{i}( L=−∑i( label i ∗ log ( _{i} * \log \left(\right. i∗log( pred i ) + ( 1 − \left._{i}\right) +\left(1-\right. i)+(1− label i ) ∗ log ( 1 − pred i ) ) \left._{i}\right) * \log \left(1-\operatorname{pred}_{i}\right)) i)∗log(1−predi))
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
这篇论文主要讲了使用未配对数据进行gan网络训练的方法。
这个网络较pixel to pixel要复杂些, 有两个generator:G, F. 两个D:Dx和Dy。
其loss函数的定义如下:
L ( G , F , D X , D Y ) = L G A N ( G , D Y , X , Y ) + L G A N ( F , D X , Y , X ) + λ L cyc ( G , F ) \begin{aligned} \mathcal{L}\left(G, F, D_{X}, D_{Y}\right) &=\mathcal{L}_{\mathrm{GAN}}\left(G, D_{Y}, X, Y\right) \\ &+\mathcal{L}_{\mathrm{GAN}}\left(F, D_{X}, Y, X\right) \\ &+\lambda \mathcal{L}_{\text {cyc }}(G, F) \end{aligned} L(G,F,DX,DY)=LGAN(G,DY,X,Y)+LGAN(F,DX,Y,X)+λLcyc (G,F)
前两个是常见的Gan损失函数,
L
G
A
N
(
G
,
D
Y
,
X
,
Y
)
=
E
y
∼
p
data
(
y
)
[
log
D
Y
(
y
)
]
+
E
x
∼
p
data
(
x
)
[
log
(
1
−
D
Y
(
G
(
x
)
)
]
\begin{aligned} \mathcal{L}_{\mathrm{GAN}}\left(G, D_{Y}, X, Y\right) &=\mathbb{E}_{y \sim p_{\text {data }}(y)}\left[\log D_{Y}(y)\right] \\ &+\mathbb{E}_{x \sim p_{\text {data }}(x)}\left[\log \left(1-D_{Y}(G(x))\right]\right.\end{aligned}
LGAN(G,DY,X,Y)=Ey∼pdata (y)[logDY(y)]+Ex∼pdata (x)[log(1−DY(G(x))]
Cycle Consistency Loss定义如下:
L
c
y
c
(
G
,
F
)
=
E
x
∼
p
data
(
x
)
[
∥
F
(
G
(
x
)
)
−
x
∥
1
]
+
E
y
∼
p
data
(
y
)
[
∥
G
(
F
(
y
)
)
−
y
∥
1
]
\begin{aligned} \mathcal{L}_{\mathrm{cyc}}(G, F) &=\mathbb{E}_{x \sim p_{\text {data }}(x)}\left[\|F(G(x))-x\|_{1}\right] \\ &+\mathbb{E}_{y \sim p_{\text {data }}(y)}\left[\|G(F(y))-y\|_{1}\right] \end{aligned}
Lcyc(G,F)=Ex∼pdata (x)[∥F(G(x))−x∥1]+Ey∼pdata (y)[∥G(F(y))−y∥1]
最终要优化的目标函数:
G
∗
,
F
∗
=
arg
min
G
,
F
max
D
x
,
D
Y
L
(
G
,
F
,
D
X
,
D
Y
)
G^{*}, F^{*}=\arg \min _{G, F} \max _{D_{x}, D_{Y}} \mathcal{L}\left(G, F, D_{X}, D_{Y}\right)
G∗,F∗=argminG,FmaxDx,DYL(G,F,DX,DY)
当然, 实际代码中为了提升模型训练的稳定性,作者又做了以下改动:
In particular, for a GAN loss LGAN(G; D; X; Y ),
we train the G to minimize E x ∼ p data ( x ) [ ( D ( G ( x ) ) − 1 ) 2 ] \mathbb{E}_{x \sim p_{\text {data }}(x)}\left[(D(G(x))-1)^{2}\right] Ex∼pdata (x)[(D(G(x))−1)2]
and train the D to minimize E y ∼ p data ( y ) [ ( D ( y ) − 1 ) 2 ] + \mathbb{E}_{y \sim p_{\text {data }}(y)}\left[(D(y)-1)^{2}\right]+ Ey∼pdata (y)[(D(y)−1)2]+ E x ∼ p data ( x ) [ D ( G ( x ) ) 2 ] \mathbb{E}_{x \sim p_{\text {data }}(x)}\left[D(G(x))^{2}\right] Ex∼pdata (x)[D(G(x))2]
针对风格转换类的任务,作者使用的是FID分数作为评价指标, 对于分割类的任务,则使用像素精度以及iou等指标作为判断标准。
Online Multi-Granularity Distillation for GAN Compression
这篇文章主要介绍了如何使用蒸馏学习对gan网络进行压缩。
作者使用两个教师网络(一个网络更深, 一个网络更宽)同时训练学生网络,两个教师网络生成的图片会经过D的鉴别,但是学生网络不直接与D接触, 学生网络的训练仅仅受两个教师网络的影响。
整个网络结构如下所示:
- 学生网络的蒸馏损失函数:
L K D ( p t , p s ) = λ S S I M L S S I M + λ f e a t u r e L feature + λ style L style + λ T V L T V \begin{aligned} \mathcal{L}_{K D}\left(p_{t}, p_{s}\right)=& \lambda_{S S I M} \mathcal{L}_{S S I M}+\lambda_{f e a t u r e} \mathcal{L}_{\text {feature }} \\ &+\lambda_{\text {style }} \mathcal{L}_{\text {style }}+\lambda_{T V} \mathcal{L}_{T V} \end{aligned} LKD(pt,ps)=λSSIMLSSIM+λfeatureLfeature +λstyle Lstyle +λTVLTV
其中
-
L
S
S
I
M
L_{SSIM}
LSSIM 描述了两个图片的相似程度,对于两个图片
p
t
,
p
s
p_t, p_s
pt,ps有:
L S S I M ( p t , p s ) = ( 2 μ t μ s + C 1 ) ( 2 σ t s + C 2 ) ( μ t 2 μ s 2 + C 1 ) ( σ t 2 + σ s 2 + C 2 ) \mathcal{L}_{S S I M}\left(p_{t}, p_{s}\right)=\frac{\left(2 \mu_{t} \mu_{s}+C_{1}\right)\left(2 \sigma_{t s}+C_{2}\right)}{\left(\mu_{t}^{2} \mu_{s}^{2}+C_{1}\right)\left(\sigma_{t}^{2}+\sigma_{s}^{2}+C_{2}\right)} LSSIM(pt,ps)=(μt2μs2+C1)(σt2+σs2+C2)(2μtμs+C1)(2σts+C2)
where μ s , μ t \mu_{s}, \mu_{t} μs,μt are mean values for luminance estimation, σ s 2 , σ t 2 \sigma_{s}^{2}, \sigma_{t}^{2} σs2,σt2 are standard deviations for contrast, σ t s \sigma_{t s} σts is covariance for the structural similarity estimation. C 1 , C 2 C_{1}, C_{2} C1,C2 are constants to avoid zero denominator
- feature损失,论文使用的是VGG提取特征
L feature ( p t , p s ) = 1 C j H j W j ∥ ϕ j ( p t ) − ϕ j ( p s ) ∥ 1 \mathcal{L}_{\text {feature }}\left(p_{t}, p_{s}\right)=\frac{1}{C_{j} H_{j} W_{j}}\left\|\phi_{j}\left(p_{t}\right)-\phi_{j}\left(p_{s}\right)\right\|_{1} Lfeature (pt,ps)=CjHjWj1∥ϕj(pt)−ϕj(ps)∥1
where ϕ j ( x ) \phi_{j}(x) ϕj(x) is the activation of the j j j-th layer of ϕ \phi ϕ for the input x . C j × H j × W j x . C_{j} \times H_{j} \times W_{j} x.Cj×Hj×Wj is the dimensions of ϕ j ( x ) \phi_{j}(x) ϕj(x).
- style损失
L style ( p t , p s ) = ∥ G j ϕ ( p t ) − G j ϕ ( p s ) ∥ 1 \mathcal{L}_{\text {style }}\left(p_{t}, p_{s}\right)=\left\|G_{j}^{\phi}\left(p_{t}\right)-G_{j}^{\phi}\left(p_{s}\right)\right\|_{1} Lstyle (pt,ps)=∥∥∥Gjϕ(pt)−Gjϕ(ps)∥∥∥1
where G j ϕ ( x ) G_{j}^{\phi}(x) Gjϕ(x) is the Gram matrix of the j j j-th layer activation in the VGG network.
- 教师网络损失函数
- 总的损失函数定义:
G T ∗ = arg min G T max D L G A N ( G T , D ) + L R e c o n ( G T ) G_{T}^{*}=\arg \min _{G_{T}} \max _{D} \mathcal{L}_{G A N}\left(G_{T}, D\right)+\mathcal{L}_{R e c o n}\left(G_{T}\right) GT∗=argminGTmaxDLGAN(GT,D)+LRecon(GT)
其中:常规的gan损失函数:
L G A N ( G T , D ) = E x , y [ log D ( x , y ) ] + E x [ log ( 1 − D ( x , G T ( x ) ) ] \begin{aligned} \mathcal{L}_{G A N}\left(G_{T}, D\right)=& \mathbb{E}_{x, y}[\log D(x, y)] \\ &+\mathbb{E}_{x}\left[\log \left(1-D\left(x, G_{T}(x)\right)\right]\right.\end{aligned} LGAN(GT,D)=Ex,y[logD(x,y)]+Ex[log(1−D(x,GT(x))] - recon损失函数
L Recon ( G T ) = E x , y [ ∥ y − G T ( x ) ∥ 1 ] \mathcal{L}_{\text {Recon }}\left(G_{T}\right)=\mathbb{E}_{x, y}\left[\left\|y-G_{T}(x)\right\|_{1}\right] LRecon (GT)=Ex,y[∥y−GT(x)∥1]
整个训练的损失函数定义:
L
(
G
T
W
,
G
T
D
,
G
S
)
\mathcal{L}\left(G_{T}^{W}, G_{T}^{D}, G_{S}\right)
L(GTW,GTD,GS)
=
λ
C
D
L
C
D
(
G
T
W
,
G
S
)
+
L
K
D
m
u
l
t
i
(
p
t
w
,
p
t
d
,
p
s
)
=\lambda_{C D} \mathcal{L}_{C D}\left(G_{T}^{W}, G_{S}\right)+\mathcal{L}_{K D_{m u l t i}}\left(p_{t}^{w}, p_{t}^{d}, p_{s}\right)
=λCDLCD(GTW,GS)+LKDmulti(ptw,ptd,ps)
其中LKD的定义如下:
L
K
D
(
p
t
,
p
s
)
=
λ
S
S
I
M
L
S
S
I
M
+
λ
feature
L
feature
+
λ
style
L
style
+
λ
T
V
L
T
V
\begin{aligned} \mathcal{L}_{K D}\left(p_{t}, p_{s}\right)=& \lambda_{S S I M} \mathcal{L}_{S S I M}+\lambda_{\text {feature }} \mathcal{L}_{\text {feature }} \\ &+\lambda_{\text {style }} \mathcal{L}_{\text {style }}+\lambda_{T V} \mathcal{L}_{T V} \end{aligned}
LKD(pt,ps)=λSSIMLSSIM+λfeature Lfeature +λstyle Lstyle +λTVLTV
- Multiple Teachers Distillation损失函数
L
(
G
T
W
,
G
T
D
,
G
S
)
\mathcal{L}\left(G_{T}^{W}, G_{T}^{D}, G_{S}\right)
L(GTW,GTD,GS)
=
λ
C
D
L
C
D
(
G
T
W
,
G
S
)
+
L
K
D
m
u
l
t
i
(
p
t
w
,
p
t
d
,
p
s
)
=\lambda_{C D} \mathcal{L}_{C D}\left(G_{T}^{W}, G_{S}\right)+\mathcal{L}_{K D_{m u l t i}}\left(p_{t}^{w}, p_{t}^{d}, p_{s}\right)
=λCDLCD(GTW,GS)+LKDmulti(ptw,ptd,ps)
其中
L
C
D
(
G
T
W
,
G
S
)
=
1
n
∑
i
=
1
n
(
∑
j
=
1
c
(
w
t
w
i
j
−
w
s
i
j
)
2
c
)
\mathcal{L}_{C D}\left(G_{T}^{W}, G_{S}\right)=\frac{1}{n} \sum_{i=1}^{n}\left(\frac{\sum_{j=1}^{c}\left(w_{t_{w}}^{i j}-w_{s}^{i j}\right)^{2}}{c}\right)
LCD(GTW,GS)=n1∑i=1n(c∑j=1c(wtwij−wsij)2)
L
K
D
m
u
l
t
i
(
p
t
w
,
p
t
d
,
p
s
)
\mathcal{L}_{K D_{m u l t i}}\left(p_{t}^{w}, p_{t}^{d}, p_{s}\right)
LKDmulti(ptw,ptd,ps)
=
L
K
D
(
p
t
w
,
p
s
)
+
L
K
D
(
p
t
d
,
p
s
)
=\mathcal{L}_{K D}\left(p_{t}^{w}, p_{s}\right)+\mathcal{L}_{K D}\left(p_{t}^{d}, p_{s}\right)
=LKD(ptw,ps)+LKD(ptd,ps)
- 评价指标FID