Ch03 Improving the way neural networks learn
在线书籍 http://neuralnetworksanddeeplearning.com/chap3.html
目录
文章目录
Introducing the cross-entropy cost function
-
Verify that σ ′ ( z ) = σ ( z ) ( 1 − σ ( z ) ) \sigma '(z) = \sigma (z)(1-\sigma(z)) σ′(z)=σ(z)(1−σ(z)).
证明: σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1+e^{-z}} σ(z)=1+e−z1σ ′ ( z ) = − ( 1 + e − z ) ′ ( 1 + e − z ) 2 = e − z ( 1 + e − z ) 2 = 1 1 + e − z ⋅ e − z 1 + e − z \sigma '(z) = -\frac{(1+e^{-z})'}{(1+e^{-z})^2}=\frac{e^{-z}}{(1+e^{-z})^2}=\frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}} σ′(z)=−(1+e−z)2(1+e−z)′=(1+e−z)2e−z=1+e−z1⋅1+e−ze−z
其中, e − z 1 + e − z = 1 + e − z − 1 1 + e − z = 1 − 1 1 + e − z \frac{e^{-z}}{1+e^{-z}} = \frac{1+e^{-z}-1}{1+e^{-z}} = 1-\frac{1}{1+e^{-z}} 1+e−ze−z=1+e−z1+e−z−1=1−1+e−z1
由于 σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1+e^{-z}} σ(z)=1+e−z1,
所以 σ ′ ( z ) = σ ( z ) ( 1 − σ ( z ) ) \sigma '(z) = \sigma (z)(1-\sigma(z)) σ′(z)=σ(z)(1−σ(z)).
-
One gotcha with the cross-entropy is that it can be difficult at first to remember the respective roles of the y y ys and the a a as. It’s easy to get confused about whether the right form is − [ y ln a + ( 1 − y ) ln ( 1 − a ) ] -[y \ln a+(1-y)\ln (1-a)] −[ylna+(1−y)ln(1−a)] or [ a ln y + ( 1 − a ) ln ( 1 − y ) ] [a\ln y+(1-a)\ln (1-y)] [alny+(1−a)ln(1−y)]. What happens to the second of these expressions when y = 0 y=0 y=0 or 1 1 1? Does this problem afflict the first expression? Why or why not?
证明: 当 C = − 1 n ∑ x [ y ln a + ( 1 − y ) ln ( 1 − a ) ] C = -\frac{1}{n}\sum_x\left[y\ln a + (1 - y) \ln(1 - a) \right] C=−n1∑x[ylna+(1−y)ln(1−a)] (57)时,
∂ C ∂ w j = 1 n ∑ x x j ( σ ( z ) − y ) . \frac{\partial C}{\partial w_j} = \frac{1}{n}\sum_x x_j(\sigma(z)-y). ∂wj∂C=n1∑xxj(σ(z)−y). (61)
∂ C ∂ b = 1 n ∑ x ( σ ( z ) − y ) . \frac{\partial C}{\partial b} = \frac{1}{n}\sum_x (\sigma(z)-y). ∂b∂C=n1∑x(σ(z)−y). (62)
无论 y = 1 y = 1 y=1 ,还是 y = 0 y=0 y=0,计算过程都相对简单
若 C = − 1 n ∑ x [ a ln y + ( 1 − a ) ln ( 1 − y ) ] C = -\frac{1}{n}\sum_x\left[a\ln y + (1-a)\ln(1-y)\right] C=−n1∑x[alny+(1−a)ln(1−y)] ,
则当 y = 1 y = 1 y=1 时 a ln y = 0 a\ln y = 0 alny=0 , ( 1 − a ) ln ( 1 − y ) (1-a)\ln(1-y) (1−a)ln(1−y) 无意义
或 y = 0 y=0 y=0 时 ( 1 − a ) ln ( 1 − y ) = 0 (1-a)\ln(1-y) = 0 (1−a)ln(1−y)=0, a ln y a\ln y alny 无意义
且 ∂ C ∂ w j = 1 n ∑ x σ ′ ( z ) ln y 1 − y = 1 n ∑ x x j σ ( z ) ( 1 − σ ( z ) ) ln y 1 − y \frac{\partial C}{\partial w_j} = \frac{1}{n}\sum_x\sigma'(z)\ln\frac{y}{1-y} =\frac{1}{n}\sum_x x_j\sigma(z)(1-\sigma(z))\ln\frac{y}{1-y} ∂wj∂C=n1∑xσ′(z)ln1−yy=n1∑xxjσ(z)(1−σ(z))ln1−yy,其中无论 y = 0 y=0 y=0 还是 y = 1 y=1 y=1, ln y 1 − y \ln\frac{y}{1-y} ln1−yy 都无法计算;同理, ∂ C ∂ b = 1 n ∑ x σ ( z ) ( 1 − σ ( z ) ) ln y 1 − y \frac{\partial C}{\partial b}=\frac{1}{n}\sum_x \sigma(z)(1-\sigma(z))\ln\frac{y}{1-y} ∂b∂C=n1∑xσ(z)(1−σ(z))ln1−yy 无法计算,因此不能用二式替换一式
-
In the single-neuron discussion at the start of this section, I argued that the cross-entropy is small if σ ( z ) ≈ y \sigma(z)\approx y σ(z)≈y for all training inputs. The argument relied on y y y being equal to either 0 0 0 or 1 1 1. This is usually true in classification problems, but for other problems (e.g., regression problems) y y y can sometimes take values intermediate between 0 0 0 and 1 1 1. Show that the cross-entropy is still minimized when σ ( z ) = y \sigma(z)=y σ(z)=y for all training inputs. When this is the case the cross-entropy has the value:
C = − 1 n ∑ x [ y ln y + ( 1 − y ) ln ( 1 − y ) ] ( 64 ) C=-\frac{1}{n}\sum_x\left[y\ln y+(1-y)\ln(1-y)\right] \ \ \ \ \ \ (64) C=−n1x∑[ylny+(1−y)ln(1−y)] (64)
The quantity − [ y ln y + ( 1 − y ) ln ( 1 − y ) ] -\left[y\ln y+(1-y)\ln(1-y)\right] −[ylny+(1−y)ln(1−y)] is sometimes known as the binary entropy.
证明: 在回归问题中,(交叉熵)损失函数 C = − 1 n ∑ x [ y ln a + ( 1 − y ) ln ( 1 − a ) ] C = -\frac{1}{n}\sum_x\left[y\ln a + (1 - y) \ln(1 - a) \right] C=−n1∑x[ylna+(1−y)ln(1−a)] (57),
由于 a ∈ ( 0 , 1 ) a\in(0,1) a∈(0,1) 所以 ln a , ln ( 1 − a ) < 0 \ln a,\ \ln(1-a) < 0 lna, ln(1−a)<0 且 y ∈ ( 0 , 1 ) y\in(0,1) y∈(0,1),所以 C ≥ 0 C \geq 0 C≥0,且当 a → 0 a\rightarrow 0 a→0 时, C → ∞ C\rightarrow \infty C→∞,所以 C C C 没有最大值。∂ C ∂ a = − 1 n ∑ x [ y a − 1 − y 1 − a ] = 1 n ∑ x a − y a ( 1 − a ) \frac{\partial C}{\partial a} = -\frac{1}{n}\sum_x\left[\frac{y}{a}-\frac{1-y}{1-a}\right]=\frac{1}{n}\sum_x\frac{a-y}{a(1-a)} ∂a∂C=−n1∑x[ay−1−a1−y]=n1∑xa(1−a)a−y
由于 a a a 取值范围为 ( 0 , 1 ) (0,1) (0,1),所以 a ( 1 − a ) > 0 a(1-a)>0 a(1−a)>0,当 a > y a>y a>y 时, ∂ C ∂ a > 0 \frac{\partial C}{\partial a} > 0 ∂a∂C>0 , C C C 关于 a a a 单调递增;当 a < y a<y a<y 时, ∂ C ∂ a < 0 \frac{\partial C}{\partial a} < 0 ∂a∂C<0, C C C 关于 a a a 单调递减;因此当 a = y a=y a=y时取最小值。所以在回归问题中,也是最小化优化问题。
-
Many-layer multi-neuron networks
In the notation introduced in the last chapter, show that for the quadratic cost the partial derivative with respect to weights in the output layer is
∂ C ∂ w j k L = 1 n ∑ x a k L − 1 ( a j L − y j ) σ ′ ( z j L ) . ( 65 ) \frac{\partial C}{\partial w_{jk}^L}= \frac{1}{n}\sum_xa_k^{L-1}(a_j^L-y_j)\sigma'(z_j^L).\ \ \ \ (65) ∂wjkL∂C=n1x∑akL−1(ajL−yj)σ′(zjL). (65)
The term σ ′ ( z j L ) \sigma'(z_j^L) σ′(zjL) causes a learning slowdown whenever an output neuron saturates on the wrong value. Show that for the cross-entropy cost the output error δ L \delta ^L δL for a single training example x x x is given by
δ L = a L − y . ( 66 ) \delta^L = a^L-y. \ \ \ \ (66) δL=aL−y. (66)Use this expression to show that the partial derivative with respect to the weights in the output layer is given by
∂ C ∂ w j k L = 1 n ∑ x a k L − 1 ( a j L − y j ) . ( 67 ) \frac{\partial C}{\partial w_{jk}^L}= \frac{1}{n}\sum_xa_k^{L-1}(a_j^L-y_j).\ \ \ \ (67) ∂wjkL∂C=n1x∑akL−1(ajL−yj). (67)
The σ ′ ( z j L ) \sigma'(z_j^L) σ′(zjL) term has vanished, and so the cross-entropy avoids the problem of learning slowdown, not just when used with a single neuron, as we saw earlier, but also in many-layer multi-neuron networks. A simple variation on this analysis holds also for the biases. If this is not obvious to you, then you should work through that analysis as well.证明: (BP1)1 (BP2)2 (BP3)3 (BP4)4
若损失函数为 C = 1 2 n ∑ x ∑ j ( a j L − y j ) 2 C = \frac{1}{2n}\sum_x\sum_j(a_j^L-y_j)^2 C=2n1∑x∑j(ajL−yj)2
由(BP1)
δ j L = ∂ C ∂ a j σ ′ ( z j ) = 1 n ∑ x ( a j L − y j ) σ ′ ( z j L ) \delta_j^L = \frac{\partial C}{\partial a_j}\sigma'(z_j)=\frac{1}{n}\sum_x (a_j^L-y_j)\sigma'(z_j^L) δjL=∂aj∂Cσ′(zj)=n1∑x(ajL−yj)σ′(zjL)
由(BP4)
∂ C ∂ w j k L = [ 1 n ∑ x ( a j L − y j ) σ ′ ( z j L ) ] a k L − 1 ( 65 ) \frac{\partial C}{\partial w_{jk}^L}=\left[\frac{1}{n}\sum_x (a_j^L-y_j)\sigma'(z_j^L)\right]a_k^{L-1} \ \ \ \ (65) ∂wjkL∂C=[n1∑x(ajL−