Andrew Ng-深度学习-第一门课-week3(激活函数和BP)

1.2.2 第一位代表第一门课,第二位代表第几周,第三位代表第几次视频。编号和视频顺序对应,有些章节视频内容较少进行了省略。对内容进行简单的总结,而不是全面的记录视频的每一个细节,详细可见[1]。

1.神经网络和深度学习

1.3 浅层神经网络

1.3.1 神经网络概述

上一周已经学过了逻辑回归,LR可以认为是简单的一层NN。假设我们有一个两层NN:
在这里插入图片描述

前向传播过程:

  • 第一层NN:
    x W [ 1 ] b [ 1 ] }    ⟹    z [ 1 ] = W [ 1 ] x + b [ 1 ]    ⟹    a [ 1 ] = σ ( z [ 1 ] ) \left. \begin{array}{r} {x }\\ {W^{[1]}}\\ {b^{[1]}} \end{array} \right\} \implies{z^{[1]}=W^{[1]}x+b^{[1]}} \implies{a^{[1]} = \sigma(z^{[1]})} xW[1]b[1]z[1]=W[1]x+b[1]a[1]=σ(z[1])
  • 第二层NN:
    a [ 1 ] = σ ( z [ 1 ] ) W [ 2 ] b [ 2 ] }    ⟹    z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ]    ⟹    a [ 2 ] = σ ( z [ 2 ] )    ⟹    L ( a [ 2 ] , y ) \left. \begin{array}{r} \text{$a^{[1]} = \sigma(z^{[1]})$}\\ \text{$W^{[2]}$}\\ \text{$b^{[2]}$}\\ \end{array} \right\} \implies{z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}} \implies{a^{[2]} = \sigma(z^{[2]})}\\ \implies{{L}\left(a^{[2]},y \right)} a[1]=σ(z[1])W[2]b[2]z[2]=W[2]a[1]+b[2]a[2]=σ(z[2])L(a[2],y)

反向传播过程:
d a [ 1 ] = d σ ( z [ 1 ] ) d W [ 2 ] d b [ 2 ] }    ⟸    d z [ 2 ] = d ( W [ 2 ] α [ 1 ] + b [ 2 ] )    ⟸    d a [ 2 ] = d σ ( z [ 2 ] )    ⟸    d L ( a [ 2 ] , y ) \left. \begin{array}{r} {da^{[1]} = {d}\sigma(z^{[1]})}\\ {dW^{[2]}}\\ {db^{[2]}}\\ \end{array} \right\} \impliedby{{dz}^{[2]}={d}(W^{[2]}\alpha^{[1]}+b^{[2]}}) \impliedby{{{da}^{[2]}} = {d}\sigma(z^{[2]})}\\ \impliedby{{dL}\left(a^{[2]},y \right)} da[1]=dσ(z[1])dW[2]db[2]dz[2]=d(W[2]α[1]+b[2])da[2]=dσ(z[2])dL(a[2],y)

1.3.2 神经网络的表示

在这里插入图片描述
我们通过上角标代表第几层神经网络:

  • 输入层:[ x 1 x_1 x1; x 2 x_2 x2; x 3 x_3 x3],也可用 a [ 0 ] a^{[0]} a[0]表示,
    a [ 0 ] = [ a 1 [ 0 ] a 2 [ 0 ] a 3 [ 0 ] a 4 [ 0 ] ] a^{[0]} = \left[ \begin{array}{ccc} a^{[0]}_{1}\\ a^{[0]}_{2}\\ a^{[0]}_{3}\\ a^{[0]}_{4}\\ \end{array} \right] a[0]=a1[0]a2[0]a3[0]a4[0]

  • 第一层隐藏层:四个隐藏单元
    a [ 1 ] = [ a 1 [ 1 ] a 2 [ 1 ] a 3 [ 1 ] a 4 [ 1 ] ] a^{[1]} = \left[ \begin{array}{ccc} a^{[1]}_{1}\\ a^{[1]}_{2}\\ a^{[1]}_{3}\\ a^{[1]}_{4}\\ \end{array} \right] a[1]=a1[1]a2[1]a3[1]a4[1]

  • 输出层: y ^ = a [ 2 ] \hat{y} = a^{[2]} y^=a[2]

  • 第一层参数:( W [ 1 ] W^{[1]} W[1], b [ 1 ] b^{[1]} b[1]), W [ 1 ] ∈ R 4 × 3 W^{[1]} \in R^{4 \times 3} W[1]R4×3 , b [ 1 ] ∈ R 4 × 1 b^{[1]} \in R^{4 \times 1} b[1]R4×1

  • 第二层参数:( W [ 2 ] W^{[2]} W[2], b [ 2 ] b^{[2]} b[2]), W [ 2 ] ∈ R 1 × 4 W^{[2]} \in R^{1 \times 4} W[2]R1×4 , b [ 2 ] ∈ R 1 × 1 b^{[2]} \in R^{1 \times 1} b[2]R1×1

1.3.3 计算一个神经网络的输出

每一个神经元即可以认为是一个逻辑回归模块。
在这里插入图片描述
隐藏层的神经元计算过程:
a 1 [ 1 ] = σ ( z 1 [ 1 ] ) , z 1 [ 1 ] = W 1 [ 1 ] T x + b 1 [ 1 ] a^{[1]}_1 = \sigma(z^{[1]}_1),z^{[1]}_1 = W^{[1]T}_1x + b^{[1]}_1 a1[1]=σ(z1[1]),z1[1]=W1[1]Tx+b1[1]

a 2 [ 1 ] = σ ( z 2 [ 1 ] ) , z 2 [ 1 ] = W 2 [ 1 ] T x + b 2 [ 1 ] a^{[1]}_2 = \sigma(z^{[1]}_2),z^{[1]}_2 = W^{[1]T}_2x + b^{[1]}_2 a2[1]=σ(z2[1]),z2[1]=W2[1]Tx+b2[1]

a 3 [ 1 ] = σ ( z 3 [ 1 ] ) , z 3 [ 1 ] = W 3 [ 1 ] T x + b 3 [ 1 ] a^{[1]}_3 = \sigma(z^{[1]}_3),z^{[1]}_3 = W^{[1]T}_3x + b^{[1]}_3 a3[1]=σ(z3[1]),z3[1]=W3[1]Tx+b3[1]

a 4 [ 1 ] = σ ( z 4 [ 1 ] ) , z 4 [ 1 ] = W 4 [ 1 ] T x + b 4 [ 1 ] a^{[1]}_4 = \sigma(z^{[1]}_4),z^{[1]}_4 = W^{[1]T}_4x + b^{[1]}_4 a4[1]=σ(z4[1]),z4[1]=W4[1]Tx+b4[1]

计算过程向量化:
a [ n ] = σ ( z [ n ] ) , z [ n ] = W [ n ] x + b [ n ] a^{[n]}=\sigma(z^{[n]}),z^{[n]} = W^{[n]}x + b^{[n]} a[n]=σ(z[n])z[n]=W[n]x+b[n]

详细计算过程:
a [ 1 ] = [ a 1 [ 1 ] a 2 [ 1 ] a 3 [ 1 ] a 4 [ 1 ] ] = σ ( z [ 1 ] ) a^{[1]} = \left[ \begin{array}{c} a^{[1]}_{1}\\ a^{[1]}_{2}\\ a^{[1]}_{3}\\ a^{[1]}_{4} \end{array} \right] = \sigma(z^{[1]}) a[1]=a1[1]a2[1]a3[1]a4[1]=σ(z[1])

[ z 1 [ 1 ] z 2 [ 1 ] z 3 [ 1 ] z 4 [ 1 ] ] = [ . . . W 1 [ 1 ] T . . . . . . W 2 [ 1 ] T . . . . . . W 3 [ 1 ] T . . . . . . W 4 [ 1 ] T . . . ] ⏞ W [ 1 ] ∗ [ x 1 x 2 x 3 ] ⏞ i n p u t + [ b 1 [ 1 ] b 2 [ 1 ] b 3 [ 1 ] b 4 [ 1 ] ] ⏞ b [ 1 ] \left[ \begin{array}{c} z^{[1]}_{1}\\ z^{[1]}_{2}\\ z^{[1]}_{3}\\ z^{[1]}_{4}\\ \end{array} \right] = \overbrace{ \left[ \begin{array}{c} ...W^{[1]T}_{1}...\\ ...W^{[1]T}_{2}...\\ ...W^{[1]T}_{3}...\\ ...W^{[1]T}_{4}... \end{array} \right] }^{W^{[1]}} * \overbrace{ \left[ \begin{array}{c} x_1\\ x_2\\ x_3\\ \end{array} \right] }^{input} + \overbrace{ \left[ \begin{array}{c} b^{[1]}_1\\ b^{[1]}_2\\ b^{[1]}_3\\ b^{[1]}_4\\ \end{array} \right] }^{b^{[1]}} z1[1]z2[1]z3[1]z4[1]=...W1[1]T......W2[1]T......W3[1]T......W4[1]T... W[1]x1x2x3 input+b1[1]b2[1]b3[1]b4[1] b[1]

1.3.4 多样本向量化

重复上述单样本的计算过程即可,可通过循环实现,当然向量化才是最佳操作。

  • 设输入为 X X X,有m个样本,上角标 ( i ) (i) (i)代表第 i i i个样本:
    X = [ ⋮ ⋮ ⋮ ⋮ x ( 1 ) x ( 2 ) ⋯ x ( m ) ⋮ ⋮ ⋮ ⋮ ] X = \left[ \begin{array}{c} \vdots & \vdots & \vdots & \vdots\\ x^{(1)} & x^{(2)} & \cdots & x^{(m)}\\ \vdots & \vdots & \vdots & \vdots\\ \end{array} \right] X=x(1)x(2)x(m)
  • 中间变量 Z [ 1 ] Z^{[1]} Z[1]
    Z [ 1 ] = [ ⋮ ⋮ ⋮ ⋮ z [ 1 ] ( 1 ) z [ 1 ] ( 2 ) ⋯ z [ 1 ] ( m ) ⋮ ⋮ ⋮ ⋮ ] Z^{[1]} = \left[ \begin{array}{c} \vdots & \vdots & \vdots & \vdots\\ z^{[1](1)} & z^{[1](2)} & \cdots & z^{[1](m)}\\ \vdots & \vdots & \vdots & \vdots\\ \end{array} \right] Z[1]=z[1](1)z[1](2)z[1](m)
  • 隐藏层 A [ 1 ] A^{[1]} A[1]
    A [ 1 ] = [ ⋮ ⋮ ⋮ ⋮ α [ 1 ] ( 1 ) α [ 1 ] ( 2 ) ⋯ α [ 1 ] ( m ) ⋮ ⋮ ⋮ ⋮ ] A^{[1]} = \left[ \begin{array}{c} \vdots & \vdots & \vdots & \vdots\\ \alpha^{[1](1)} & \alpha^{[1](2)} & \cdots & \alpha^{[1](m)}\\ \vdots & \vdots & \vdots & \vdots\\ \end{array} \right] A[1]=α[1](1)α[1](2)α[1](m)
  • 前向传播公式
    z [ 1 ] ( i ) = W [ 1 ] ( i ) x ( i ) + b [ 1 ] α [ 1 ] ( i ) = σ ( z [ 1 ] ( i ) ) z [ 2 ] ( i ) = W [ 2 ] ( i ) α [ 1 ] ( i ) + b [ 2 ] α [ 2 ] ( i ) = σ ( z [ 2 ] ( i ) ) }    ⟹    { Z [ 1 ] = W [ 1 ] X + b A [ 1 ] = σ ( Z [ 1 ] ) z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] A [ 2 ] = σ ( z [ 2 ] ) \left. \begin{array}{r} \text{$z^{[1](i)} = W^{[1](i)}x^{(i)} + b^{[1]}$}\\ \text{$\alpha^{[1](i)} = \sigma(z^{[1](i)})$}\\ \text{$z^{[2](i)} = W^{[2](i)}\alpha^{[1](i)} + b^{[2]}$}\\ \text{$\alpha^{[2](i)} = \sigma(z^{[2](i)})$}\\ \end{array} \right\} \implies \begin{cases} \text{$Z^{[1]} = W^{[1]}X+b$}\\ \text{$A^{[1]} = \sigma(Z^{[1]})$}\\ \text{$z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$}\\ \text{$A^{[2]} = \sigma(z^{[2]})$}\\ \end{cases} z[1](i)=W[1](i)x(i)+b[1]α[1](i)=σ(z[1](i))z[2](i)=W[2](i)α[1](i)+b[2]α[2](i)=σ(z[2](i))Z[1]=W[1]X+bA[1]=σ(Z[1])z[2]=W[2]A[1]+b[2]A[2]=σ(z[2])

这些数据,每一列代表不同的样本,每一行代表NN中的不同结点。

1.3.6 激活函数

1.sigmoid:
σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{{1 + e}^{- z}} σ(z)=1+ez1

缺点:

  • 梯度消失,即饱和问题;
  • 非均值,输出不是以0为中心
  • 解析式还有幂运算

优点:

  • 值域是[0,1]

导数:
d d z g ( z ) = 1 1 + e − z ( 1 − 1 1 + e − z ) = g ( z ) ( 1 − g ( z ) ) \frac{d}{dz}g(z) = {\frac{1}{1 + e^{-z}} (1-\frac{1}{1 + e^{-z}})}=g(z)(1-g(z)) dzdg(z)=1+ez1(11+ez1)=g(z)(1g(z))

  • z z z = 10或 z = − 10 z= -10 z=10 ; d d z g ( z ) ≈ 0 \frac{d}{dz}g(z)\approx0 dzdg(z)0
  • z z z= 0 , d d z g ( z ) =g(z)(1-g(z))= 1 / 4 \frac{d}{dz}g(z)\text{=g(z)(1-g(z))=}{1}/{4} dzdg(z)=g(z)(1-g(z))=1/4
  • 神经网络中 a = g ( z ) a= g(z) a=g(z); g ( z ) ′ = d d z g ( z ) = a ( 1 − a ) g{{(z)}^{'}}=\frac{d}{dz}g(z)=a(1-a) g(z)=dzdg(z)=a(1a)

2.tanh:
t a n h ( z ) = e z − e − z e z + e − z tanh(z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}} tanh(z)=ez+ezezez

缺点:

  • 梯度消失;
  • 幂运算

优点:

  • 值域是[-1,1],解决0均值问题;

导数:
d d z g ( z ) = 1 − ( t a n h ( z ) ) 2 \frac{d}{{d}z}g(z) = 1 - (tanh(z))^{2} dzdg(z)=1(tanh(z))2

  • z z z = 10或 z = − 10 z= -10 z=10 d d z g ( z ) ≈ 0 \frac{d}{dz}g(z)\approx0 dzdg(z)0
  • z z z = 0, d d z g ( z ) =1-(0)= 1 \frac{d}{dz}g(z)\text{=1-(0)=}1 dzdg(z)=1-(0)=1
  • 在神经网络中: a = g ( z ) a= g(z) a=g(z); g ( z ) ′ = d d z g ( z ) = ( 1 − a 2 ) g{{(z)}^{'}}=\frac{d}{dz}g(z)=(1-a^2) g(z)=dzdg(z)=(1a2)

3.Relu: f ( z ) = m a x ( 0 , z ) f(z) = max(0,z) f(z)=max(0,z)

优点:

  • 仅需一个阈值,复杂度低;
  • 单侧抑制,网络稀疏表达,控制过拟合
  • 非饱和性,有效解决梯度消失问题

缺点:

  • 非0均值(zero-centered)
  • 负神经元梯度死亡-导致参数梯度无法更新

导数:
g ( z ) ′ = { 0 if z < 0 1 if z > 0 u n d e f i n e d if z = 0 g(z)^{'}= \begin{cases} 0& \text{if z < 0}\\ 1& \text{if z > 0}\\ undefined& \text{if z = 0} \end{cases} g(z)=01undefinedif z < 0if z > 0if z = 0

4.Leaky Relu: f ( z ) = m a x ( α z , z ) f(z) = max(\alpha z, z) f(z)=max(αz,z)

优点:

  • 解决神经元死亡问题
  • 复杂度低,单侧抑制,非饱和性

导数:
g ( z ) ′ = { α if z < 0 1 if z > 0 u n d e f i n e d if z = 0 g(z)^{'}= \begin{cases} \alpha & \text{if z < 0}\\ 1& \text{if z > 0}\\ undefined& \text{if z = 0} \end{cases} g(z)=α1undefinedif z < 0if z > 0if z = 0

5.Softmax:
σ ( z j ) = e z j ∑ k = 1 K e z k \sigma(z_{j})=\frac{e^{z_{j}}}{\sum_{k=1}^{K} e^{z_{k}}} σ(zj)=k=1Kezkezj
和sigmoid相似,只是多分类输出的时候使用。sigmoid用于二分类输出概率。

1.3.10 直观理解反向传播

回顾逻辑回归公式:
x w b } ⏟ d w = d z ⋅ x , d b = d z    ⟸    z = w T x + b ⏟ d z = d a ⋅ g ′ ( z ) , g ( z ) = σ ( z ) , d L d z = d L d a ⋅ d a d z , d d z g ( z ) = g ′ ( z )    ⟸    a = σ ( z )    ⟸    L ( a , y ) ⏟ d a = d d a L ( a , y ) = ( − y log ⁡ α − ( 1 − y ) log ⁡ ( 1 − a ) ) ′ = − y a + 1 − y 1 − a \underbrace{ \left. \begin{array}{l} {x }\\ {w }\\ {b } \end{array} \right\} }_{dw={dz}\cdot x, db =dz} \impliedby\underbrace{{z={w}^Tx+b}}_{dz=da\cdot g^{'}(z), g(z)=\sigma(z), {\frac{{dL}}{dz}}={\frac{{dL}}{da}}\cdot{\frac{da}{dz}}, {\frac{d}{ dz}}g(z)=g^{'}(z)} \impliedby\underbrace{{a = \sigma(z)} \impliedby{L(a,y)}}_{da={\frac{{d}}{da}}{L}\left(a,y \right)=(-y\log{\alpha} - (1 - y)\log(1 - a))^{'}={-\frac{y}{a}} + {\frac{1 - y}{1 - a}{}} } dw=dzx,db=dz xwbdz=dag(z),g(z)=σ(z),dzdL=dadLdzda,dzdg(z)=g(z) z=wTx+bda=dadL(a,y)=(ylogα(1y)log(1a))=ay+1a1y a=σ(z)L(a,y)
神经网络可以认为是多层的LR:
公式3.44:
L = 1 m ∑ i n L ( y ^ , y ) , d Z [ 2 ] = A [ 2 ] − Y    , L = {\frac{1}{m}}\sum_i^n{L(\hat{y},y)},dZ^{[2]}=A^{[2]}-Y\;, L=m1inL(y^,y)dZ[2]=A[2]Y

d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T , d b [ 2 ] = 1 m n p . s u m ( d Z [ 2 ] , a x i s = 1 , k e e p d i m s = T r u e ) dW^{[2]}={\frac{1}{m}}dZ^{[2]}{A^{[1]}}^{T},db^{[2]} = {\frac{1}{m}}np.sum(dZ^{[2]},axis=1,keepdims=True) dW[2]=m1dZ[2]A[1]Tdb[2]=m1np.sum(dZ[2],axis=1,keepdims=True)

d Z [ 1 ] ⏟ ( n [ 1 ] , m ) = W [ 2 ] T d Z [ 2 ] ⏟ ( n [ 1 ] , m ) ∗ g [ 1 ] ′ ( Z [ 1 ] ) ⏟ ( n [ 1 ] , m ) \underbrace{dZ^{[1]}}_{(n^{[1]}, m)} = \underbrace{W^{[2]T}dZ^{[2]}}_{(n^{[1]}, m)}*\underbrace{g[1]^{'}(Z^{[1]})}_{(n^{[1]}, m)} (n[1],m) dZ[1]=(n[1],m) W[2]TdZ[2](n[1],m) g[1](Z[1])

d W [ 1 ] = 1 m d Z [ 1 ] x T dW^{[1]} = {\frac{1}{m}}dZ^{[1]}x^{T} dW[1]=m1dZ[1]xT

d b [ 1 ] = 1 m n p . s u m ( d Z [ 1 ] , a x i s = 1 , k e e p d i m s = T r u e ) db^{[1]} = {\frac{1}{m}}np.sum(dZ^{[1]},axis=1,keepdims=True) db[1]=m1np.sum(dZ[1],axis=1,keepdims=True)

1.3.11 随机初始化

如果你把权重或者参数都初始化为0,那么梯度下降将不会起作用。将会使得每一层的权重都是相等的,同步更新没有其他变化。
在这里插入图片描述

  • W W W进行随机初始化,有效避免这个问题, b b b不需要初始化;
  • W W W一般初始化为很小的随机数, b b b不需要初始化; e g . eg. eg.
    W [ 1 ] = n p . r a n d o m . r a n d n ( 2 , 2 )    ∗    0.01    ,    b [ 1 ] = n p . z e r o s ( ( 2 , 1 ) ) W^{[1]} = np.random.randn(2,2)\;*\;0.01\;,\;b^{[1]} = np.zeros((2,1)) W[1]=np.random.randn(2,2)0.01,b[1]=np.zeros((2,1))
    W [ 2 ] = n p . r a n d o m . r a n d n ( 2 , 2 )    ∗    0.01    ,    b [ 2 ] = 0 W^{[2]} = np.random.randn(2,2)\;*\;0.01\;,\;b^{[2]} = 0 W[2]=np.random.randn(2,2)0.01,b[2]=0
  • 初始化为较小的数是因为, s i g m o i d / t a n h sigmoid/tanh sigmoid/tanh在参数很大时,梯度很小甚至消失;
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

linxid

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值