1.2.2 第一位代表第一门课,第二位代表第几周,第三位代表第几次视频。编号和视频顺序对应,有些章节视频内容较少进行了省略。对内容进行简单的总结,而不是全面的记录视频的每一个细节,详细可见[1]。
1.神经网络和深度学习
1.3 浅层神经网络
1.3.1 神经网络概述
上一周已经学过了逻辑回归,LR可以认为是简单的一层NN。假设我们有一个两层NN:
前向传播过程:
- 第一层NN:
x W [ 1 ] b [ 1 ] } ⟹ z [ 1 ] = W [ 1 ] x + b [ 1 ] ⟹ a [ 1 ] = σ ( z [ 1 ] ) \left. \begin{array}{r} {x }\\ {W^{[1]}}\\ {b^{[1]}} \end{array} \right\} \implies{z^{[1]}=W^{[1]}x+b^{[1]}} \implies{a^{[1]} = \sigma(z^{[1]})} xW[1]b[1]⎭⎬⎫⟹z[1]=W[1]x+b[1]⟹a[1]=σ(z[1]) - 第二层NN:
a [ 1 ] = σ ( z [ 1 ] ) W [ 2 ] b [ 2 ] } ⟹ z [ 2 ] = W [ 2 ] a [ 1 ] + b [ 2 ] ⟹ a [ 2 ] = σ ( z [ 2 ] ) ⟹ L ( a [ 2 ] , y ) \left. \begin{array}{r} \text{$a^{[1]} = \sigma(z^{[1]})$}\\ \text{$W^{[2]}$}\\ \text{$b^{[2]}$}\\ \end{array} \right\} \implies{z^{[2]}=W^{[2]}a^{[1]}+b^{[2]}} \implies{a^{[2]} = \sigma(z^{[2]})}\\ \implies{{L}\left(a^{[2]},y \right)} a[1]=σ(z[1])W[2]b[2]⎭⎬⎫⟹z[2]=W[2]a[1]+b[2]⟹a[2]=σ(z[2])⟹L(a[2],y)
反向传播过程:
d
a
[
1
]
=
d
σ
(
z
[
1
]
)
d
W
[
2
]
d
b
[
2
]
}
⟸
d
z
[
2
]
=
d
(
W
[
2
]
α
[
1
]
+
b
[
2
]
)
⟸
d
a
[
2
]
=
d
σ
(
z
[
2
]
)
⟸
d
L
(
a
[
2
]
,
y
)
\left. \begin{array}{r} {da^{[1]} = {d}\sigma(z^{[1]})}\\ {dW^{[2]}}\\ {db^{[2]}}\\ \end{array} \right\} \impliedby{{dz}^{[2]}={d}(W^{[2]}\alpha^{[1]}+b^{[2]}}) \impliedby{{{da}^{[2]}} = {d}\sigma(z^{[2]})}\\ \impliedby{{dL}\left(a^{[2]},y \right)}
da[1]=dσ(z[1])dW[2]db[2]⎭⎬⎫⟸dz[2]=d(W[2]α[1]+b[2])⟸da[2]=dσ(z[2])⟸dL(a[2],y)
1.3.2 神经网络的表示
我们通过上角标代表第几层神经网络:
-
输入层:[ x 1 x_1 x1; x 2 x_2 x2; x 3 x_3 x3],也可用 a [ 0 ] a^{[0]} a[0]表示,
a [ 0 ] = [ a 1 [ 0 ] a 2 [ 0 ] a 3 [ 0 ] a 4 [ 0 ] ] a^{[0]} = \left[ \begin{array}{ccc} a^{[0]}_{1}\\ a^{[0]}_{2}\\ a^{[0]}_{3}\\ a^{[0]}_{4}\\ \end{array} \right] a[0]=⎣⎢⎢⎢⎡a1[0]a2[0]a3[0]a4[0]⎦⎥⎥⎥⎤ -
第一层隐藏层:四个隐藏单元
a [ 1 ] = [ a 1 [ 1 ] a 2 [ 1 ] a 3 [ 1 ] a 4 [ 1 ] ] a^{[1]} = \left[ \begin{array}{ccc} a^{[1]}_{1}\\ a^{[1]}_{2}\\ a^{[1]}_{3}\\ a^{[1]}_{4}\\ \end{array} \right] a[1]=⎣⎢⎢⎢⎡a1[1]a2[1]a3[1]a4[1]⎦⎥⎥⎥⎤ -
输出层: y ^ = a [ 2 ] \hat{y} = a^{[2]} y^=a[2]
-
第一层参数:( W [ 1 ] W^{[1]} W[1], b [ 1 ] b^{[1]} b[1]), W [ 1 ] ∈ R 4 × 3 W^{[1]} \in R^{4 \times 3} W[1]∈R4×3 , b [ 1 ] ∈ R 4 × 1 b^{[1]} \in R^{4 \times 1} b[1]∈R4×1
-
第二层参数:( W [ 2 ] W^{[2]} W[2], b [ 2 ] b^{[2]} b[2]), W [ 2 ] ∈ R 1 × 4 W^{[2]} \in R^{1 \times 4} W[2]∈R1×4 , b [ 2 ] ∈ R 1 × 1 b^{[2]} \in R^{1 \times 1} b[2]∈R1×1
1.3.3 计算一个神经网络的输出
每一个神经元即可以认为是一个逻辑回归模块。
隐藏层的神经元计算过程:
a
1
[
1
]
=
σ
(
z
1
[
1
]
)
,
z
1
[
1
]
=
W
1
[
1
]
T
x
+
b
1
[
1
]
a^{[1]}_1 = \sigma(z^{[1]}_1),z^{[1]}_1 = W^{[1]T}_1x + b^{[1]}_1
a1[1]=σ(z1[1]),z1[1]=W1[1]Tx+b1[1]
a 2 [ 1 ] = σ ( z 2 [ 1 ] ) , z 2 [ 1 ] = W 2 [ 1 ] T x + b 2 [ 1 ] a^{[1]}_2 = \sigma(z^{[1]}_2),z^{[1]}_2 = W^{[1]T}_2x + b^{[1]}_2 a2[1]=σ(z2[1]),z2[1]=W2[1]Tx+b2[1]
a 3 [ 1 ] = σ ( z 3 [ 1 ] ) , z 3 [ 1 ] = W 3 [ 1 ] T x + b 3 [ 1 ] a^{[1]}_3 = \sigma(z^{[1]}_3),z^{[1]}_3 = W^{[1]T}_3x + b^{[1]}_3 a3[1]=σ(z3[1]),z3[1]=W3[1]Tx+b3[1]
a 4 [ 1 ] = σ ( z 4 [ 1 ] ) , z 4 [ 1 ] = W 4 [ 1 ] T x + b 4 [ 1 ] a^{[1]}_4 = \sigma(z^{[1]}_4),z^{[1]}_4 = W^{[1]T}_4x + b^{[1]}_4 a4[1]=σ(z4[1]),z4[1]=W4[1]Tx+b4[1]
计算过程向量化:
a
[
n
]
=
σ
(
z
[
n
]
)
,
z
[
n
]
=
W
[
n
]
x
+
b
[
n
]
a^{[n]}=\sigma(z^{[n]}),z^{[n]} = W^{[n]}x + b^{[n]}
a[n]=σ(z[n]),z[n]=W[n]x+b[n]
详细计算过程:
a
[
1
]
=
[
a
1
[
1
]
a
2
[
1
]
a
3
[
1
]
a
4
[
1
]
]
=
σ
(
z
[
1
]
)
a^{[1]} = \left[ \begin{array}{c} a^{[1]}_{1}\\ a^{[1]}_{2}\\ a^{[1]}_{3}\\ a^{[1]}_{4} \end{array} \right] = \sigma(z^{[1]})
a[1]=⎣⎢⎢⎢⎡a1[1]a2[1]a3[1]a4[1]⎦⎥⎥⎥⎤=σ(z[1])
[ z 1 [ 1 ] z 2 [ 1 ] z 3 [ 1 ] z 4 [ 1 ] ] = [ . . . W 1 [ 1 ] T . . . . . . W 2 [ 1 ] T . . . . . . W 3 [ 1 ] T . . . . . . W 4 [ 1 ] T . . . ] ⏞ W [ 1 ] ∗ [ x 1 x 2 x 3 ] ⏞ i n p u t + [ b 1 [ 1 ] b 2 [ 1 ] b 3 [ 1 ] b 4 [ 1 ] ] ⏞ b [ 1 ] \left[ \begin{array}{c} z^{[1]}_{1}\\ z^{[1]}_{2}\\ z^{[1]}_{3}\\ z^{[1]}_{4}\\ \end{array} \right] = \overbrace{ \left[ \begin{array}{c} ...W^{[1]T}_{1}...\\ ...W^{[1]T}_{2}...\\ ...W^{[1]T}_{3}...\\ ...W^{[1]T}_{4}... \end{array} \right] }^{W^{[1]}} * \overbrace{ \left[ \begin{array}{c} x_1\\ x_2\\ x_3\\ \end{array} \right] }^{input} + \overbrace{ \left[ \begin{array}{c} b^{[1]}_1\\ b^{[1]}_2\\ b^{[1]}_3\\ b^{[1]}_4\\ \end{array} \right] }^{b^{[1]}} ⎣⎢⎢⎢⎡z1[1]z2[1]z3[1]z4[1]⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡...W1[1]T......W2[1]T......W3[1]T......W4[1]T...⎦⎥⎥⎥⎤ W[1]∗⎣⎡x1x2x3⎦⎤ input+⎣⎢⎢⎢⎡b1[1]b2[1]b3[1]b4[1]⎦⎥⎥⎥⎤ b[1]
1.3.4 多样本向量化
重复上述单样本的计算过程即可,可通过循环实现,当然向量化才是最佳操作。
- 设输入为
X
X
X,有m个样本,上角标
(
i
)
(i)
(i)代表第
i
i
i个样本:
X = [ ⋮ ⋮ ⋮ ⋮ x ( 1 ) x ( 2 ) ⋯ x ( m ) ⋮ ⋮ ⋮ ⋮ ] X = \left[ \begin{array}{c} \vdots & \vdots & \vdots & \vdots\\ x^{(1)} & x^{(2)} & \cdots & x^{(m)}\\ \vdots & \vdots & \vdots & \vdots\\ \end{array} \right] X=⎣⎢⎢⎡⋮x(1)⋮⋮x(2)⋮⋮⋯⋮⋮x(m)⋮⎦⎥⎥⎤ - 中间变量
Z
[
1
]
Z^{[1]}
Z[1]:
Z [ 1 ] = [ ⋮ ⋮ ⋮ ⋮ z [ 1 ] ( 1 ) z [ 1 ] ( 2 ) ⋯ z [ 1 ] ( m ) ⋮ ⋮ ⋮ ⋮ ] Z^{[1]} = \left[ \begin{array}{c} \vdots & \vdots & \vdots & \vdots\\ z^{[1](1)} & z^{[1](2)} & \cdots & z^{[1](m)}\\ \vdots & \vdots & \vdots & \vdots\\ \end{array} \right] Z[1]=⎣⎢⎢⎡⋮z[1](1)⋮⋮z[1](2)⋮⋮⋯⋮⋮z[1](m)⋮⎦⎥⎥⎤ - 隐藏层
A
[
1
]
A^{[1]}
A[1]:
A [ 1 ] = [ ⋮ ⋮ ⋮ ⋮ α [ 1 ] ( 1 ) α [ 1 ] ( 2 ) ⋯ α [ 1 ] ( m ) ⋮ ⋮ ⋮ ⋮ ] A^{[1]} = \left[ \begin{array}{c} \vdots & \vdots & \vdots & \vdots\\ \alpha^{[1](1)} & \alpha^{[1](2)} & \cdots & \alpha^{[1](m)}\\ \vdots & \vdots & \vdots & \vdots\\ \end{array} \right] A[1]=⎣⎢⎢⎡⋮α[1](1)⋮⋮α[1](2)⋮⋮⋯⋮⋮α[1](m)⋮⎦⎥⎥⎤ - 前向传播公式
z [ 1 ] ( i ) = W [ 1 ] ( i ) x ( i ) + b [ 1 ] α [ 1 ] ( i ) = σ ( z [ 1 ] ( i ) ) z [ 2 ] ( i ) = W [ 2 ] ( i ) α [ 1 ] ( i ) + b [ 2 ] α [ 2 ] ( i ) = σ ( z [ 2 ] ( i ) ) } ⟹ { Z [ 1 ] = W [ 1 ] X + b A [ 1 ] = σ ( Z [ 1 ] ) z [ 2 ] = W [ 2 ] A [ 1 ] + b [ 2 ] A [ 2 ] = σ ( z [ 2 ] ) \left. \begin{array}{r} \text{$z^{[1](i)} = W^{[1](i)}x^{(i)} + b^{[1]}$}\\ \text{$\alpha^{[1](i)} = \sigma(z^{[1](i)})$}\\ \text{$z^{[2](i)} = W^{[2](i)}\alpha^{[1](i)} + b^{[2]}$}\\ \text{$\alpha^{[2](i)} = \sigma(z^{[2](i)})$}\\ \end{array} \right\} \implies \begin{cases} \text{$Z^{[1]} = W^{[1]}X+b$}\\ \text{$A^{[1]} = \sigma(Z^{[1]})$}\\ \text{$z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$}\\ \text{$A^{[2]} = \sigma(z^{[2]})$}\\ \end{cases} z[1](i)=W[1](i)x(i)+b[1]α[1](i)=σ(z[1](i))z[2](i)=W[2](i)α[1](i)+b[2]α[2](i)=σ(z[2](i))⎭⎪⎪⎬⎪⎪⎫⟹⎩⎪⎪⎪⎨⎪⎪⎪⎧Z[1]=W[1]X+bA[1]=σ(Z[1])z[2]=W[2]A[1]+b[2]A[2]=σ(z[2])
这些数据,每一列代表不同的样本,每一行代表NN中的不同结点。
1.3.6 激活函数
1.sigmoid:
σ
(
z
)
=
1
1
+
e
−
z
\sigma(z) = \frac{1}{{1 + e}^{- z}}
σ(z)=1+e−z1
缺点:
- 梯度消失,即饱和问题;
- 非均值,输出不是以0为中心
- 解析式还有幂运算
优点:
- 值域是[0,1]
导数:
d
d
z
g
(
z
)
=
1
1
+
e
−
z
(
1
−
1
1
+
e
−
z
)
=
g
(
z
)
(
1
−
g
(
z
)
)
\frac{d}{dz}g(z) = {\frac{1}{1 + e^{-z}} (1-\frac{1}{1 + e^{-z}})}=g(z)(1-g(z))
dzdg(z)=1+e−z1(1−1+e−z1)=g(z)(1−g(z))
- 当 z z z = 10或 z = − 10 z= -10 z=−10 ; d d z g ( z ) ≈ 0 \frac{d}{dz}g(z)\approx0 dzdg(z)≈0
- 当 z z z= 0 , d d z g ( z ) =g(z)(1-g(z))= 1 / 4 \frac{d}{dz}g(z)\text{=g(z)(1-g(z))=}{1}/{4} dzdg(z)=g(z)(1-g(z))=1/4
- 神经网络中 a = g ( z ) a= g(z) a=g(z); g ( z ) ′ = d d z g ( z ) = a ( 1 − a ) g{{(z)}^{'}}=\frac{d}{dz}g(z)=a(1-a) g(z)′=dzdg(z)=a(1−a)
2.tanh:
t
a
n
h
(
z
)
=
e
z
−
e
−
z
e
z
+
e
−
z
tanh(z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}
tanh(z)=ez+e−zez−e−z
缺点:
- 梯度消失;
- 幂运算
优点:
- 值域是[-1,1],解决0均值问题;
导数:
d
d
z
g
(
z
)
=
1
−
(
t
a
n
h
(
z
)
)
2
\frac{d}{{d}z}g(z) = 1 - (tanh(z))^{2}
dzdg(z)=1−(tanh(z))2
- 当 z z z = 10或 z = − 10 z= -10 z=−10 d d z g ( z ) ≈ 0 \frac{d}{dz}g(z)\approx0 dzdg(z)≈0
- 当 z z z = 0, d d z g ( z ) =1-(0)= 1 \frac{d}{dz}g(z)\text{=1-(0)=}1 dzdg(z)=1-(0)=1
- 在神经网络中: a = g ( z ) a= g(z) a=g(z); g ( z ) ′ = d d z g ( z ) = ( 1 − a 2 ) g{{(z)}^{'}}=\frac{d}{dz}g(z)=(1-a^2) g(z)′=dzdg(z)=(1−a2)
3.Relu: f ( z ) = m a x ( 0 , z ) f(z) = max(0,z) f(z)=max(0,z)
优点:
- 仅需一个阈值,复杂度低;
- 单侧抑制,网络稀疏表达,控制过拟合
- 非饱和性,有效解决梯度消失问题
缺点:
- 非0均值(zero-centered)
- 负神经元梯度死亡-导致参数梯度无法更新
导数:
g
(
z
)
′
=
{
0
if z < 0
1
if z > 0
u
n
d
e
f
i
n
e
d
if z = 0
g(z)^{'}= \begin{cases} 0& \text{if z < 0}\\ 1& \text{if z > 0}\\ undefined& \text{if z = 0} \end{cases}
g(z)′=⎩⎪⎨⎪⎧01undefinedif z < 0if z > 0if z = 0
4.Leaky Relu: f ( z ) = m a x ( α z , z ) f(z) = max(\alpha z, z) f(z)=max(αz,z)
优点:
- 解决神经元死亡问题
- 复杂度低,单侧抑制,非饱和性
导数:
g
(
z
)
′
=
{
α
if z < 0
1
if z > 0
u
n
d
e
f
i
n
e
d
if z = 0
g(z)^{'}= \begin{cases} \alpha & \text{if z < 0}\\ 1& \text{if z > 0}\\ undefined& \text{if z = 0} \end{cases}
g(z)′=⎩⎪⎨⎪⎧α1undefinedif z < 0if z > 0if z = 0
5.Softmax:
σ
(
z
j
)
=
e
z
j
∑
k
=
1
K
e
z
k
\sigma(z_{j})=\frac{e^{z_{j}}}{\sum_{k=1}^{K} e^{z_{k}}}
σ(zj)=∑k=1Kezkezj
和sigmoid相似,只是多分类输出的时候使用。sigmoid用于二分类输出概率。
1.3.10 直观理解反向传播
回顾逻辑回归公式:
x
w
b
}
⏟
d
w
=
d
z
⋅
x
,
d
b
=
d
z
⟸
z
=
w
T
x
+
b
⏟
d
z
=
d
a
⋅
g
′
(
z
)
,
g
(
z
)
=
σ
(
z
)
,
d
L
d
z
=
d
L
d
a
⋅
d
a
d
z
,
d
d
z
g
(
z
)
=
g
′
(
z
)
⟸
a
=
σ
(
z
)
⟸
L
(
a
,
y
)
⏟
d
a
=
d
d
a
L
(
a
,
y
)
=
(
−
y
log
α
−
(
1
−
y
)
log
(
1
−
a
)
)
′
=
−
y
a
+
1
−
y
1
−
a
\underbrace{ \left. \begin{array}{l} {x }\\ {w }\\ {b } \end{array} \right\} }_{dw={dz}\cdot x, db =dz} \impliedby\underbrace{{z={w}^Tx+b}}_{dz=da\cdot g^{'}(z), g(z)=\sigma(z), {\frac{{dL}}{dz}}={\frac{{dL}}{da}}\cdot{\frac{da}{dz}}, {\frac{d}{ dz}}g(z)=g^{'}(z)} \impliedby\underbrace{{a = \sigma(z)} \impliedby{L(a,y)}}_{da={\frac{{d}}{da}}{L}\left(a,y \right)=(-y\log{\alpha} - (1 - y)\log(1 - a))^{'}={-\frac{y}{a}} + {\frac{1 - y}{1 - a}{}} }
dw=dz⋅x,db=dz
xwb⎭⎬⎫⟸dz=da⋅g′(z),g(z)=σ(z),dzdL=dadL⋅dzda,dzdg(z)=g′(z)
z=wTx+b⟸da=dadL(a,y)=(−ylogα−(1−y)log(1−a))′=−ay+1−a1−y
a=σ(z)⟸L(a,y)
神经网络可以认为是多层的LR:
公式3.44:
L
=
1
m
∑
i
n
L
(
y
^
,
y
)
,
d
Z
[
2
]
=
A
[
2
]
−
Y
,
L = {\frac{1}{m}}\sum_i^n{L(\hat{y},y)},dZ^{[2]}=A^{[2]}-Y\;,
L=m1i∑nL(y^,y),dZ[2]=A[2]−Y,
d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T , d b [ 2 ] = 1 m n p . s u m ( d Z [ 2 ] , a x i s = 1 , k e e p d i m s = T r u e ) dW^{[2]}={\frac{1}{m}}dZ^{[2]}{A^{[1]}}^{T},db^{[2]} = {\frac{1}{m}}np.sum(dZ^{[2]},axis=1,keepdims=True) dW[2]=m1dZ[2]A[1]T,db[2]=m1np.sum(dZ[2],axis=1,keepdims=True)
d Z [ 1 ] ⏟ ( n [ 1 ] , m ) = W [ 2 ] T d Z [ 2 ] ⏟ ( n [ 1 ] , m ) ∗ g [ 1 ] ′ ( Z [ 1 ] ) ⏟ ( n [ 1 ] , m ) \underbrace{dZ^{[1]}}_{(n^{[1]}, m)} = \underbrace{W^{[2]T}dZ^{[2]}}_{(n^{[1]}, m)}*\underbrace{g[1]^{'}(Z^{[1]})}_{(n^{[1]}, m)} (n[1],m) dZ[1]=(n[1],m) W[2]TdZ[2]∗(n[1],m) g[1]′(Z[1])
d W [ 1 ] = 1 m d Z [ 1 ] x T dW^{[1]} = {\frac{1}{m}}dZ^{[1]}x^{T} dW[1]=m1dZ[1]xT
d b [ 1 ] = 1 m n p . s u m ( d Z [ 1 ] , a x i s = 1 , k e e p d i m s = T r u e ) db^{[1]} = {\frac{1}{m}}np.sum(dZ^{[1]},axis=1,keepdims=True) db[1]=m1np.sum(dZ[1],axis=1,keepdims=True)
1.3.11 随机初始化
如果你把权重或者参数都初始化为0,那么梯度下降将不会起作用。将会使得每一层的权重都是相等的,同步更新没有其他变化。
- 对 W W W进行随机初始化,有效避免这个问题, b b b不需要初始化;
-
W
W
W一般初始化为很小的随机数,
b
b
b不需要初始化;
e
g
.
eg.
eg.
W [ 1 ] = n p . r a n d o m . r a n d n ( 2 , 2 ) ∗ 0.01 , b [ 1 ] = n p . z e r o s ( ( 2 , 1 ) ) W^{[1]} = np.random.randn(2,2)\;*\;0.01\;,\;b^{[1]} = np.zeros((2,1)) W[1]=np.random.randn(2,2)∗0.01,b[1]=np.zeros((2,1))
W [ 2 ] = n p . r a n d o m . r a n d n ( 2 , 2 ) ∗ 0.01 , b [ 2 ] = 0 W^{[2]} = np.random.randn(2,2)\;*\;0.01\;,\;b^{[2]} = 0 W[2]=np.random.randn(2,2)∗0.01,b[2]=0 - 初始化为较小的数是因为, s i g m o i d / t a n h sigmoid/tanh sigmoid/tanh在参数很大时,梯度很小甚至消失;