minGmaxDV(D,G)=Epdata(x)[log(D(x))]+Epz(z)[log(1−D(G(z)))]…(1)(9)
(9)
min
G
max
D
V
(
D
,
G
)
=
E
p
d
a
t
a
(
x
)
[
l
o
g
(
D
(
x
)
)
]
+
E
p
z
(
z
)
[
l
o
g
(
1
−
D
(
G
(
z
)
)
)
]
…
(
1
)
其中:
maxDV(D,G)=Epdata(x)[log(D(x))]+Epz(z)[log(1−D(G(z)))]…(2)
max
D
V
(
D
,
G
)
=
E
p
d
a
t
a
(
x
)
[
l
o
g
(
D
(
x
)
)
]
+
E
p
z
(
z
)
[
l
o
g
(
1
−
D
(
G
(
z
)
)
)
]
…
(
2
)
等价于
minDV(D,G)=−∫Pdata(x)[log(D(x))]dx−∫Pz(z)[log(1−D(G(z)))]dz
min
D
V
(
D
,
G
)
=
−
∫
P
d
a
t
a
(
x
)
[
l
o
g
(
D
(
x
)
)
]
d
x
−
∫
P
z
(
z
)
[
l
o
g
(
1
−
D
(
G
(
z
)
)
)
]
d
z
In practice, equation may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 − D(G(z))) saturates. Rather than training G to minimize log(1 − D(G(z))) we can train G to maximize log D(G(z)).
实际上,等式(1)没有提供给生成器G不断优化所需的足够梯度。因此在训练的初期,当生成器G还很弱时,由于生成的假数据明显不同于训练的真实数据,判别器D有高的置信度去拒绝这些假样本。在这里,log(1−D(G(z)))
l
o
g
(
1
−
D
(
G
(
z
)
)
)
是饱和的。因此,我们用最大化 logD(G(z))
l
o
g
D
(
G
(
z
)
)
替代最小化 log(1−D(G(z)))
l
o
g
(
1
−
D
(
G
(
z
)
)
)
来训练G。我们可以通过下面的图来看看:
很明显, log(1−D(G(x)))
l
o
g
(
1
−
D
(
G
(
x
)
)
)
函数的导数由小变大,说明起初训练时更新速度会很慢(训练方法为mini-SGD)。但相反, logD(G(z))
l
o
g
D
(
G
(
z
)
)
函数的导数由大变小,符合我们实际训练时的要求(起初提供的梯度大,更新速度快,越接近最优值时更新的幅度越小)。
二、全局最优推导
命题1
固定生成器G,我们考虑最优的D∗
D
∗
:
V(G,D)=∫Pdata(x)[log(D(x))]dx+∫Pz(z)[log(1−D(G(z)))]dz=∫Pdata(x)[log(D(x))]+Pg(x)[log(1−D(G(x)))]dx(10)
(10)
V
(
G
,
D
)
=
∫
P
d
a
t
a
(
x
)
[
l
o
g
(
D
(
x
)
)
]
d
x
+
∫
P
z
(
z
)
[
l
o
g
(
1
−
D
(
G
(
z
)
)
)
]
d
z
=
∫
P
d
a
t
a
(
x
)
[
l
o
g
(
D
(
x
)
)
]
+
P
g
(
x
)
[
l
o
g
(
1
−
D
(
G
(
x
)
)
)
]
d
x
对于这个积分,要取其最大值,我们希望对于给定的x,积分里面的项是最大的,也就是我们希望取到一个最优的
D∗
D
∗
,使得下面这个式子最大化
f(D(x))=Pdata(x)log(D(x))+Pg(x)log(1−D(x))
f
(
D
(
x
)
)
=
P
d
a
t
a
(
x
)
l
o
g
(
D
(
x
)
)
+
P
g
(
x
)
l
o
g
(
1
−
D
(
x
)
)
我们通过求导:
f′(D(x))=Pdata(x)D(x)−Pg(x)1−D(x)
f
′
(
D
(
x
)
)
=
P
d
a
t
a
(
x
)
D
(
x
)
−
P
g
(
x
)
1
−
D
(
x
)
令上式等于0,整理得:
D∗(x)=Pdata(x)Pdata(x)+Pg(x)
D
∗
(
x
)
=
P
d
a
t
a
(
x
)
P
d
a
t
a
(
x
)
+
P
g
(
x
)
当且仅当 pg=pdata
p
g
=
p
d
a
t
a
时,C(G)
C
(
G
)
达 到 −log4
−
l
o
g
4
。 证明: 将
D∗(x)=Pdata(x)Pdata(x)+Pg(x)
D
∗
(
x
)
=
P
d
a
t
a
(
x
)
P
d
a
t
a
(
x
)
+
P
g
(
x
)
代入 (2) 式,有:
C(G)=V(G,D∗)=Epdata(x)[log(D∗(x))]+Epg[log(1−D∗(x))]=∫Pdata(x)log(Pdata(x)Pdata(x)+Pg(x))dx+∫Pg(x)log(Pg(x)Pdata(x)+Pg(x))dx={∫Pdata(x)log(Pdata(x)Pdata(x)+Pg(x))dx+log2}+{∫Pg(x)log(Pg(x)Pdata(x)+Pg(x))dx+log2}−2log2=∫Pdata(x)log(2Pdata(x)Pdata(x)+Pg(x))dx+∫Pg(x)log(2Pg(x)Pdata(x)+Pg(x))dx−log4=KL(Pdata(x)||Pdata(x)+Pg(x)2)+KL(Pg(x)||Pdata(x)+Pg(x)2)−log4=−log4+2JSD(Pdata(x)||Pg(x))(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)
(11)
C
(
G
)
=
V
(
G
,
D
∗
)
(12)
=
E
p
d
a
t
a
(
x
)
[
l
o
g
(
D
∗
(
x
)
)
]
+
E
p
g
[
l
o
g
(
1
−
D
∗
(
x
)
)
]
(13)
=
∫
P
d
a
t
a
(
x
)
l
o
g
(
P
d
a
t
a
(
x
)
P
d
a
t
a
(
x
)
+
P
g
(
x
)
)
d
x
+
(14)
∫
P
g
(
x
)
l
o
g
(
P
g
(
x
)
P
d
a
t
a
(
x
)
+
P
g
(
x
)
)
d
x
(15)
=
{
∫
P
d
a
t
a
(
x
)
l
o
g
(
P
d
a
t
a
(
x
)
P
d
a
t
a
(
x
)
+
P
g
(
x
)
)
d
x
+
l
o
g
2
}
+
(16)
{
∫
P
g
(
x
)
l
o
g
(
P
g
(
x
)
P
d
a
t
a
(
x
)
+
P
g
(
x
)
)
d
x
+
l
o
g
2
}
−
2
l
o
g
2
(17)
=
∫
P
d
a
t
a
(
x
)
l
o
g
(
2
P
d
a
t
a
(
x
)
P
d
a
t
a
(
x
)
+
P
g
(
x
)
)
d
x
+
(18)
∫
P
g
(
x
)
l
o
g
(
2
P
g
(
x
)
P
d
a
t
a
(
x
)
+
P
g
(
x
)
)
d
x
−
l
o
g
4
(19)
=
K
L
(
P
d
a
t
a
(
x
)
|
|
P
d
a
t
a
(
x
)
+
P
g
(
x
)
2
)
+
(20)
K
L
(
P
g
(
x
)
|
|
P
d
a
t
a
(
x
)
+
P
g
(
x
)
2
)
−
l
o
g
4
(21)
=
−
l
o
g
4
+
2
J
S
D
(
P
d
a
t
a
(
x
)
|
|
P
g
(
x
)
)
由于JS散度是大于等于0小于等于1的(当P1,P2完全相同时,那么JS =0, 如果完全不相同,那么就是1)。因此当 pg=pdata
p
g
=
p
d
a
t
a
时,C(G)
C
(
G
)
有最小值 -log4。
我们还可以这样理解,这个max值其实就是由JSD divergence和一个常数(-log4)构成的,这个max的值相当于衡量 Pdata(x)
P
d
a
t
a
(
x
)
与 Pg(x)
P
g
(
x
)
的差异程度。所以这个时候,我们只要取: