Neural Networks and Deep Learning 3

本文深入探讨了如何改善神经网络的学习方式,重点关注了引入交叉熵成本函数来解决过拟合问题。文章讨论了交叉熵的来源、作用及其与梯度下降的关系,同时解释了权重初始化的重要性。此外,还分析了权重衰减(正则化)在防止过拟合中的角色,并展示了如何在手写数字识别任务中应用这些概念。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Ch03 Improving the way neural networks learn

在线书籍 http://neuralnetworksanddeeplearning.com/chap3.html

目录

Introducing the cross-entropy cost function
  • Verify that σ ′ ( z ) = σ ( z ) ( 1 − σ ( z ) ) \sigma '(z) = \sigma (z)(1-\sigma(z)) σ(z)=σ(z)(1σ(z)).
    证明: σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1+e^{-z}} σ(z)=1+ez1

    σ ′ ( z ) = − ( 1 + e − z ) ′ ( 1 + e − z ) 2 = e − z ( 1 + e − z ) 2 = 1 1 + e − z ⋅ e − z 1 + e − z \sigma '(z) = -\frac{(1+e^{-z})'}{(1+e^{-z})^2}=\frac{e^{-z}}{(1+e^{-z})^2}=\frac{1}{1+e^{-z}} \cdot \frac{e^{-z}}{1+e^{-z}} σ(z)=(1+ez)2(1+ez)=(1+ez)2ez=1+ez11+ezez

    其中, e − z 1 + e − z = 1 + e − z − 1 1 + e − z = 1 − 1 1 + e − z \frac{e^{-z}}{1+e^{-z}} = \frac{1+e^{-z}-1}{1+e^{-z}} = 1-\frac{1}{1+e^{-z}} 1+ezez=1+ez1+ez1=11+ez1

    由于 σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1+e^{-z}} σ(z)=1+ez1

    所以 σ ′ ( z ) = σ ( z ) ( 1 − σ ( z ) ) \sigma '(z) = \sigma (z)(1-\sigma(z)) σ(z)=σ(z)(1σ(z)).

  • One gotcha with the cross-entropy is that it can be difficult at first to remember the respective roles of the y y ys and the a a as. It’s easy to get confused about whether the right form is − [ y ln ⁡ a + ( 1 − y ) ln ⁡ ( 1 − a ) ] -[y \ln a+(1-y)\ln (1-a)] [ylna+(1y)ln(1a)] or [ a ln ⁡ y + ( 1 − a ) ln ⁡ ( 1 − y ) ] [a\ln y+(1-a)\ln (1-y)] [alny+(1a)ln(1y)]. What happens to the second of these expressions when y = 0 y=0 y=0 or 1 1 1? Does this problem afflict the first expression? Why or why not?

    证明: C = − 1 n ∑ x [ y ln ⁡ a + ( 1 − y ) ln ⁡ ( 1 − a ) ] C = -\frac{1}{n}\sum_x\left[y\ln a + (1 - y) \ln(1 - a) \right] C=n1x[ylna+(1y)ln(1a)] (57)时,

    ∂ C ∂ w j = 1 n ∑ x x j ( σ ( z ) − y ) . \frac{\partial C}{\partial w_j} = \frac{1}{n}\sum_x x_j(\sigma(z)-y). wjC=n1xxj(σ(z)y). (61)

    ∂ C ∂ b = 1 n ∑ x ( σ ( z ) − y ) . \frac{\partial C}{\partial b} = \frac{1}{n}\sum_x (\sigma(z)-y). bC=n1x(σ(z)y). (62)

    无论 y = 1 y = 1 y=1 ,还是 y = 0 y=0 y=0,计算过程都相对简单

    C = − 1 n ∑ x [ a ln ⁡ y + ( 1 − a ) ln ⁡ ( 1 − y ) ] C = -\frac{1}{n}\sum_x\left[a\ln y + (1-a)\ln(1-y)\right] C=n1x[alny+(1a)ln(1y)]

    则当 y = 1 y = 1 y=1 a ln ⁡ y = 0 a\ln y = 0 alny=0 ( 1 − a ) ln ⁡ ( 1 − y ) (1-a)\ln(1-y) (1a)ln(1y) 无意义

    y = 0 y=0 y=0 ( 1 − a ) ln ⁡ ( 1 − y ) = 0 (1-a)\ln(1-y) = 0 (1a)ln(1y)=0 a ln ⁡ y a\ln y alny 无意义

    ∂ C ∂ w j = 1 n ∑ x σ ′ ( z ) ln ⁡ y 1 − y = 1 n ∑ x x j σ ( z ) ( 1 − σ ( z ) ) ln ⁡ y 1 − y \frac{\partial C}{\partial w_j} = \frac{1}{n}\sum_x\sigma'(z)\ln\frac{y}{1-y} =\frac{1}{n}\sum_x x_j\sigma(z)(1-\sigma(z))\ln\frac{y}{1-y} wjC=n1xσ(z)ln1yy=n1xxjσ(z)(1σ(z))ln1yy,其中无论 y = 0 y=0 y=0 还是 y = 1 y=1 y=1 ln ⁡ y 1 − y \ln\frac{y}{1-y} ln1yy 都无法计算;同理, ∂ C ∂ b = 1 n ∑ x σ ( z ) ( 1 − σ ( z ) ) ln ⁡ y 1 − y \frac{\partial C}{\partial b}=\frac{1}{n}\sum_x \sigma(z)(1-\sigma(z))\ln\frac{y}{1-y} bC=n1xσ(z)(1σ(z))ln1yy 无法计算,因此不能用二式替换一式

  • In the single-neuron discussion at the start of this section, I argued that the cross-entropy is small if σ ( z ) ≈ y \sigma(z)\approx y σ(z)y for all training inputs. The argument relied on y y y being equal to either 0 0 0 or 1 1 1. This is usually true in classification problems, but for other problems (e.g., regression problems) y y y can sometimes take values intermediate between 0 0 0 and 1 1 1. Show that the cross-entropy is still minimized when σ ( z ) = y \sigma(z)=y σ(z)=y for all training inputs. When this is the case the cross-entropy has the value:

    C = − 1 n ∑ x [ y ln ⁡ y + ( 1 − y ) ln ⁡ ( 1 − y ) ]        ( 64 ) C=-\frac{1}{n}\sum_x\left[y\ln y+(1-y)\ln(1-y)\right] \ \ \ \ \ \ (64) C=n1x[ylny+(1y)ln(1y)]      (64)

    The quantity − [ y ln ⁡ y + ( 1 − y ) ln ⁡ ( 1 − y ) ] -\left[y\ln y+(1-y)\ln(1-y)\right] [ylny+(1y)ln(1y)] is sometimes known as the binary entropy.

    证明: 在回归问题中,(交叉熵)损失函数 C = − 1 n ∑ x [ y ln ⁡ a + ( 1 − y ) ln ⁡ ( 1 − a ) ] C = -\frac{1}{n}\sum_x\left[y\ln a + (1 - y) \ln(1 - a) \right] C=n1x[ylna+(1y)ln(1a)] (57),
    由于 a ∈ ( 0 , 1 ) a\in(0,1) a(0,1) 所以 ln ⁡ a ,   ln ⁡ ( 1 − a ) < 0 \ln a,\ \ln(1-a) < 0 lna, ln(1a)<0 y ∈ ( 0 , 1 ) y\in(0,1) y(0,1),所以 C ≥ 0 C \geq 0 C0,且当 a → 0 a\rightarrow 0 a0 时, C → ∞ C\rightarrow \infty C,所以 C C C 没有最大值。

    ∂ C ∂ a = − 1 n ∑ x [ y a − 1 − y 1 − a ] = 1 n ∑ x a − y a ( 1 − a ) \frac{\partial C}{\partial a} = -\frac{1}{n}\sum_x\left[\frac{y}{a}-\frac{1-y}{1-a}\right]=\frac{1}{n}\sum_x\frac{a-y}{a(1-a)} aC=n1x[ay1a1y]=n1xa(1a)ay

    由于 a a a 取值范围为 ( 0 , 1 ) (0,1) (0,1),所以 a ( 1 − a ) > 0 a(1-a)>0 a(1a)>0,当 a > y a>y a>y 时, ∂ C ∂ a > 0 \frac{\partial C}{\partial a} > 0 aC>0 C C C 关于 a a a 单调递增;当 a < y a<y a<y 时, ∂ C ∂ a < 0 \frac{\partial C}{\partial a} < 0 aC<0 C C C 关于 a a a 单调递减;因此当 a = y a=y a=y时取最小值。所以在回归问题中,也是最小化优化问题。

  • Many-layer multi-neuron networks

    In the notation introduced in the last chapter, show that for the quadratic cost the partial derivative with respect to weights in the output layer is

    ∂ C ∂ w j k L = 1 n ∑ x a k L − 1 ( a j L − y j ) σ ′ ( z j L ) .      ( 65 ) \frac{\partial C}{\partial w_{jk}^L}= \frac{1}{n}\sum_xa_k^{L-1}(a_j^L-y_j)\sigma'(z_j^L).\ \ \ \ (65) wjkLC=n1xakL1(ajLyj)σ(zjL).    (65)

    The term σ ′ ( z j L ) \sigma'(z_j^L) σ(zjL) causes a learning slowdown whenever an output neuron saturates on the wrong value. Show that for the cross-entropy cost the output error δ L \delta ^L δL for a single training example x x x is given by
    δ L = a L − y .      ( 66 ) \delta^L = a^L-y. \ \ \ \ (66) δL=aLy.    (66)

    Use this expression to show that the partial derivative with respect to the weights in the output layer is given by
    ∂ C ∂ w j k L = 1 n ∑ x a k L − 1 ( a j L − y j ) .      ( 67 ) \frac{\partial C}{\partial w_{jk}^L}= \frac{1}{n}\sum_xa_k^{L-1}(a_j^L-y_j).\ \ \ \ (67) wjkLC=n1xakL1(ajLyj).    (67)
    The σ ′ ( z j L ) \sigma'(z_j^L) σ(zjL) term has vanished, and so the cross-entropy avoids the problem of learning slowdown, not just when used with a single neuron, as we saw earlier, but also in many-layer multi-neuron networks. A simple variation on this analysis holds also for the biases. If this is not obvious to you, then you should work through that analysis as well.

    证明: (BP1)1 (BP2)2 (BP3)3 (BP4)4

    若损失函数为 C = 1 2 n ∑ x ∑ j ( a j L − y j ) 2 C = \frac{1}{2n}\sum_x\sum_j(a_j^L-y_j)^2 C=2n1xj(ajLyj)2

    由(BP1)

    δ j L = ∂ C ∂ a j σ ′ ( z j ) = 1 n ∑ x ( a j L − y j ) σ ′ ( z j L ) \delta_j^L = \frac{\partial C}{\partial a_j}\sigma'(z_j)=\frac{1}{n}\sum_x (a_j^L-y_j)\sigma'(z_j^L) δjL=ajCσ(zj)=n1x(ajLyj)σ(zjL)

    由(BP4)

    ∂ C ∂ w j k L = [ 1 n ∑ x ( a j L − y j ) σ ′ ( z j L ) ] a k L − 1      ( 65 ) \frac{\partial C}{\partial w_{jk}^L}=\left[\frac{1}{n}\sum_x (a_j^L-y_j)\sigma'(z_j^L)\right]a_k^{L-1} \ \ \ \ (65) wjkLC=[n1x(ajL

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值