深度学习的基础:反向传播算法(一)之反向传播入门
深度学习的基础:反向传播算法(二)之稍复杂的反向传播
深度学习的基础:反向传播算法(三)之完整的反向传播算法
前言
前面介绍了单层全连接层,并且没有使用激活函数,这种情况比较简单,这一篇文章打算简单介绍一下多个输出,以及使用激活函数进行非线性激活的情况。还是请注意:这里的所有推导过程都只是针对当前设置的参数信息,并不具有一般性,但是所有的推导过程可以推导到一般的运算,因此以下给出的并不是反向传播算法的严格证明,但是可以很好的帮助理解反向传播算法。
一、参数设置
和前面一样,这里使用的是长度为3的行向量,即
x
=
[
x
1
x
2
x
3
]
x = \begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix}
x=[x1x2x3],输出这里设置为长度为2的行向量,即
y
^
=
[
y
^
1
y
^
2
]
\hat{y} = \begin{bmatrix} \hat{y}_1 & \hat{y}_2 \end{bmatrix}
y^=[y^1y^2]。权值参数我们记为
ω
\omega
ω,偏置量我们记为
b
b
b,由于这里我们模拟的是进行分类操作,因此这里引入了一个非线性激活函数
g
g
g,为了方便我们进行求导,我们这里设置激活函数为sigmoid,即:
(1)
g
(
x
)
=
1
1
+
e
−
x
,
g
′
(
x
)
=
g
(
x
)
(
1
−
g
(
x
)
)
g(x) = \frac{1}{1 + e^{-x}}, g\prime(x) = g(x)(1 - g(x)) \tag{1}
g(x)=1+e−x1,g′(x)=g(x)(1−g(x))(1)
有了上述的参数设置,我们可以有下面的式子:
(2)
g
(
x
ω
+
b
)
=
y
^
g(x \omega + b) = \hat{y} \tag{2}
g(xω+b)=y^(2)
继续将式子展开,我们有:
(3)
g
(
[
x
1
x
2
x
3
]
[
ω
11
ω
12
ω
21
ω
22
ω
31
ω
32
]
+
[
b
1
b
2
]
)
=
[
y
^
1
y
^
2
]
g(\begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix} \begin{bmatrix} \omega_{11} & \omega_{12} \\ \omega_{21} & \omega_{22} \\ \omega_{31} & \omega_{32} \\ \end{bmatrix} + \begin{bmatrix} b_1 & b_2\end{bmatrix}) = \begin{bmatrix} \hat{y}_1 & \hat{y}_2 \end{bmatrix} \tag{3}
g([x1x2x3]⎣⎡ω11ω21ω31ω12ω22ω32⎦⎤+[b1b2])=[y^1y^2](3)
三、首先不考虑激活函数
我们首先不考虑激活函数,因此,我们可以暂时将结果记为
a
=
[
a
1
a
2
]
a = \begin{bmatrix} a_1 & a_2 \end{bmatrix}
a=[a1a2]。于是,我们可以得到下面的式子:
(4)
[
a
1
a
2
]
=
[
x
1
x
2
x
3
]
[
ω
11
ω
12
ω
21
ω
22
ω
31
ω
32
]
+
[
b
1
b
2
]
\begin{bmatrix} a_1 & a_2 \end{bmatrix} = \begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix} \begin{bmatrix} \omega_{11} & \omega_{12} \\ \omega_{21} & \omega_{22} \\ \omega_{31} & \omega_{32} \\ \end{bmatrix} + \begin{bmatrix} b_1 & b_2\end{bmatrix} \tag{4}
[a1a2]=[x1x2x3]⎣⎡ω11ω21ω31ω12ω22ω32⎦⎤+[b1b2](4)
将上面的公式(4)完全展开,可以得到下面的两个式子:
(5)
a
1
=
ω
11
x
1
+
ω
21
x
2
+
ω
31
x
3
+
b
1
a_1 = \omega_{11} x_1 + \omega_{21} x_2 + \omega_{31} x_3 + b_1 \tag{5}
a1=ω11x1+ω21x2+ω31x3+b1(5)
(6)
a
2
=
ω
12
x
1
+
ω
22
x
2
+
ω
32
x
3
+
b
2
a_2 = \omega_{12} x_1 + \omega_{22} x_2 + \omega_{32} x_3 + b_2 \tag{6}
a2=ω12x1+ω22x2+ω32x3+b2(6)
和前面的情况类似,我们可以对上面的两个式子中的参数求偏导数,于是,我们得到对于各个参数的偏导数计算公式如下:
(7)
∂
a
1
∂
ω
11
=
x
1
,
∂
a
1
∂
ω
21
=
x
2
,
∂
a
1
∂
ω
31
=
x
3
,
∂
a
1
∂
b
1
=
1
\frac{\partial a_1}{\partial \omega_{11}} = x_1, \frac{\partial a_1}{\partial \omega_{21}} = x_2, \frac{\partial a_1}{\partial \omega_{31}} = x_3, \frac{\partial a_1}{\partial b_1} = 1 \tag{7}
∂ω11∂a1=x1,∂ω21∂a1=x2,∂ω31∂a1=x3,∂b1∂a1=1(7)
(8)
∂
a
2
∂
ω
12
=
x
1
,
∂
a
2
∂
ω
22
=
x
2
,
∂
a
2
∂
ω
32
=
x
3
,
∂
a
2
∂
b
2
=
1
\frac{\partial a_2}{\partial \omega_{12}} = x_1, \frac{\partial a_2}{\partial \omega_{22}} = x_2, \frac{\partial a_2}{\partial \omega_{32}} = x_3, \frac{\partial a_2}{\partial b_2} = 1 \tag{8}
∂ω12∂a2=x1,∂ω22∂a2=x2,∂ω32∂a2=x3,∂b2∂a2=1(8)
以上就是现阶段的偏导数的计算公式。下一阶段我们将激活函数也考虑进来。
四、将激活函数也考虑进来
这一阶段我们考虑对
a
=
[
a
1
a
2
]
a = \begin{bmatrix} a_1 & a_2 \end{bmatrix}
a=[a1a2] 使用非线性激活函数激活,即我们有:
(9)
y
^
=
g
(
a
)
\hat{y} = g(a) \tag{9}
y^=g(a)(9)
展开之后就变成:
(10)
[
y
^
1
y
^
2
]
=
g
(
[
a
1
a
2
]
)
\begin{bmatrix} \hat{y}_1 & \hat{y}_2 \end{bmatrix} = g(\begin{bmatrix} a_1 & a_2 \end{bmatrix}) \tag{10}
[y^1y^2]=g([a1a2])(10)
对应每一个元素,我们有:
(11)
y
^
1
=
g
(
a
1
)
,
y
^
2
=
g
(
a
2
)
\hat{y}_1 = g(a_1), \hat{y}_2 = g(a_2) \tag{11}
y^1=g(a1),y^2=g(a2)(11)
所以我们求得每一个
y
^
i
\hat{y}_i
y^i 对
a
i
a_i
ai 的偏导数如下:
(12)
∂
y
^
1
∂
a
1
=
g
′
(
a
1
)
,
∂
y
^
2
∂
a
2
=
g
′
(
a
2
)
\frac{\partial \hat{y}_1}{\partial a_1} = g\prime(a_1), \frac{\partial \hat{y}_2}{\partial a_2} = g\prime(a_2) \tag{12}
∂a1∂y^1=g′(a1),∂a2∂y^2=g′(a2)(12)
五、损失值定义
和前面的情况类似,我们使用输出与目标值之间的差值的平方和作为最后的cost,即:
(13)
C
=
c
o
s
t
=
∑
(
y
^
i
−
y
i
)
2
=
(
y
^
1
−
y
1
)
2
+
(
y
^
2
−
y
2
)
2
C = cost = \sum(\hat{y}_i - y_i)^2 = (\hat{y}_1 - y_1)^2 + (\hat{y}_2 - y_2)^2 \tag{13}
C=cost=∑(y^i−yi)2=(y^1−y1)2+(y^2−y2)2(13)
根据上式,我们可以得到
C
C
C 关于两个预测输出
y
^
1
\hat{y}_1
y^1,
y
^
2
\hat{y}_2
y^2的偏导数:
(14)
∂
C
∂
y
^
1
=
2
∗
(
y
^
1
−
y
1
)
,
∂
C
∂
y
^
2
=
2
∗
(
y
^
2
−
y
2
)
\frac{\partial C}{\partial \hat{y}_1} = 2 * (\hat{y}_1 - y_1), \frac{\partial C}{\partial \hat{y}_2} = 2 * (\hat{y}_2 - y_2) \tag{14}
∂y^1∂C=2∗(y^1−y1),∂y^2∂C=2∗(y^2−y2)(14)
六、综合
前面所做的工作实际上是在一步一步求解每一个环节的偏导数公式,根据求导公式的链式法则(chain rule),我们可以得到以下的每一个参数(
ω
\omega
ω,
b
b
b)对于最后的cost的偏导数公式:
(15)
∂
C
∂
ω
11
=
∂
a
1
∂
ω
11
⋅
∂
y
^
1
∂
a
1
⋅
∂
C
∂
y
^
1
=
x
1
⋅
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
\frac{\partial C}{\partial \omega_{11}} = \frac{\partial a_1}{\partial \omega_{11}} \cdot \frac{\partial \hat{y}_1}{\partial a_1} \cdot \frac{\partial C}{\partial \hat{y}_1} = x_1 \cdot g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) \tag{15}
∂ω11∂C=∂ω11∂a1⋅∂a1∂y^1⋅∂y^1∂C=x1⋅g′(a1)⋅2⋅(y^1−y1)(15)
(16)
∂
C
∂
ω
21
=
∂
a
1
∂
ω
21
⋅
∂
y
^
1
∂
a
1
⋅
∂
C
∂
y
^
1
=
x
2
⋅
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
\frac{\partial C}{\partial \omega_{21}} = \frac{\partial a_1}{\partial \omega_{21}} \cdot \frac{\partial \hat{y}_1}{\partial a_1} \cdot \frac{\partial C}{\partial \hat{y}_1} = x_2 \cdot g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) \tag{16}
∂ω21∂C=∂ω21∂a1⋅∂a1∂y^1⋅∂y^1∂C=x2⋅g′(a1)⋅2⋅(y^1−y1)(16)
(17)
∂
C
∂
ω
31
=
∂
a
1
∂
ω
31
⋅
∂
y
^
1
∂
a
1
⋅
∂
C
∂
y
^
1
=
x
3
⋅
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
\frac{\partial C}{\partial \omega_{31}} = \frac{\partial a_1}{\partial \omega_{31}} \cdot \frac{\partial \hat{y}_1}{\partial a_1} \cdot \frac{\partial C}{\partial \hat{y}_1} = x_3 \cdot g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) \tag{17}
∂ω31∂C=∂ω31∂a1⋅∂a1∂y^1⋅∂y^1∂C=x3⋅g′(a1)⋅2⋅(y^1−y1)(17)
(18)
∂
C
∂
b
1
=
∂
a
1
∂
b
1
⋅
∂
y
^
1
∂
a
1
⋅
∂
C
∂
y
^
1
=
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
\frac{\partial C}{\partial b_1} = \frac{\partial a_1}{\partial b_1} \cdot \frac{\partial \hat{y}_1}{\partial a_1} \cdot \frac{\partial C}{\partial \hat{y}_1} = g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) \tag{18}
∂b1∂C=∂b1∂a1⋅∂a1∂y^1⋅∂y^1∂C=g′(a1)⋅2⋅(y^1−y1)(18)
(19)
∂
C
∂
ω
12
=
∂
a
2
∂
ω
12
⋅
∂
y
^
2
∂
a
2
⋅
∂
C
∂
y
^
2
=
x
1
⋅
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
\frac{\partial C}{\partial \omega_{12}} = \frac{\partial a_2}{\partial \omega_{12}} \cdot \frac{\partial \hat{y}_2}{\partial a_2} \cdot \frac{\partial C}{\partial \hat{y}_2} = x_1 \cdot g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \tag{19}
∂ω12∂C=∂ω12∂a2⋅∂a2∂y^2⋅∂y^2∂C=x1⋅g′(a2)⋅2⋅(y^2−y2)(19)
(20)
∂
C
∂
ω
22
=
∂
a
2
∂
ω
22
⋅
∂
y
^
2
∂
a
2
⋅
∂
C
∂
y
^
2
=
x
2
⋅
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
\frac{\partial C}{\partial \omega_{22}} = \frac{\partial a_2}{\partial \omega_{22}} \cdot \frac{\partial \hat{y}_2}{\partial a_2} \cdot \frac{\partial C}{\partial \hat{y}_2} = x_2 \cdot g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \tag{20}
∂ω22∂C=∂ω22∂a2⋅∂a2∂y^2⋅∂y^2∂C=x2⋅g′(a2)⋅2⋅(y^2−y2)(20)
(21)
∂
C
∂
ω
32
=
∂
a
2
∂
ω
32
⋅
∂
y
^
2
∂
a
2
⋅
∂
C
∂
y
^
2
=
x
3
⋅
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
\frac{\partial C}{\partial \omega_{32}} = \frac{\partial a_2}{\partial \omega_{32}} \cdot \frac{\partial \hat{y}_2}{\partial a_2} \cdot \frac{\partial C}{\partial \hat{y}_2} = x_3 \cdot g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \tag{21}
∂ω32∂C=∂ω32∂a2⋅∂a2∂y^2⋅∂y^2∂C=x3⋅g′(a2)⋅2⋅(y^2−y2)(21)
(22)
∂
C
∂
b
2
=
∂
a
2
∂
b
2
⋅
∂
y
^
2
∂
a
2
⋅
∂
C
∂
y
^
2
=
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
\frac{\partial C}{\partial b_2} = \frac{\partial a_2}{\partial b_2} \cdot \frac{\partial \hat{y}_2}{\partial a_2} \cdot \frac{\partial C}{\partial \hat{y}_2} = g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \tag{22}
∂b2∂C=∂b2∂a2⋅∂a2∂y^2⋅∂y^2∂C=g′(a2)⋅2⋅(y^2−y2)(22)
和前面一样,上面的公式已经可以用于进行梯度计算和反向传播了,但是上面的公式看上去不仅繁琐而且容易出错,因此,很有必要对上面的公式进行整理,以便我们用向量和矩阵进行表示和计算。
我们将每个变量的梯度按照次序排好,首先是
ω
\omega
ω 参数的第一列,如下:
(23)
[
∂
C
∂
ω
11
∂
C
∂
ω
21
∂
C
∂
ω
31
]
=
[
x
1
⋅
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
x
2
⋅
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
x
3
⋅
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
]
=
[
x
1
x
2
x
3
]
[
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
]
\begin{bmatrix} \frac{\partial C}{\partial \omega_{11}} \\ \frac{\partial C}{\partial \omega_{21}} \\ \frac{\partial C}{\partial \omega_{31}} \\ \end{bmatrix} = \begin{bmatrix} x_1 \cdot g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) \\ x_2 \cdot g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) \\ x_3 \cdot g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \begin{bmatrix} g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) \end{bmatrix} \tag{23}
⎣⎡∂ω11∂C∂ω21∂C∂ω31∂C⎦⎤=⎣⎡x1⋅g′(a1)⋅2⋅(y^1−y1)x2⋅g′(a1)⋅2⋅(y^1−y1)x3⋅g′(a1)⋅2⋅(y^1−y1)⎦⎤=⎣⎡x1x2x3⎦⎤[g′(a1)⋅2⋅(y^1−y1)](23)
接着是
ω
\omega
ω 参数的第二列,如下:
(24)
[
∂
C
∂
ω
12
∂
C
∂
ω
22
∂
C
∂
ω
32
]
=
[
x
1
⋅
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
x
2
⋅
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
x
3
⋅
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
]
=
[
x
1
x
2
x
3
]
[
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
]
\begin{bmatrix} \frac{\partial C}{\partial \omega_{12}} \\ \frac{\partial C}{\partial \omega_{22}} \\ \frac{\partial C}{\partial \omega_{32}} \\ \end{bmatrix} = \begin{bmatrix} x_1 \cdot g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \\ x_2 \cdot g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \\ x_3 \cdot g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \begin{bmatrix} g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \end{bmatrix} \tag{24}
⎣⎡∂ω12∂C∂ω22∂C∂ω32∂C⎦⎤=⎣⎡x1⋅g′(a2)⋅2⋅(y^2−y2)x2⋅g′(a2)⋅2⋅(y^2−y2)x3⋅g′(a2)⋅2⋅(y^2−y2)⎦⎤=⎣⎡x1x2x3⎦⎤[g′(a2)⋅2⋅(y^2−y2)](24)
将两个矩阵结合在一起:
(25)
[
∂
C
∂
ω
11
∂
C
∂
ω
12
∂
C
∂
ω
121
∂
C
∂
ω
22
∂
C
∂
ω
31
∂
C
∂
ω
32
]
=
[
x
1
x
2
x
3
]
[
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
]
=
x
T
[
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
]
=
x
T
(
[
g
′
(
a
1
)
g
′
(
a
2
)
]
⋅
∗
[
2
⋅
(
y
^
1
−
y
1
)
2
⋅
(
y
^
2
−
y
2
)
]
)
\begin {aligned} \begin{bmatrix} \frac{\partial C}{\partial \omega_{11}} & \frac{\partial C}{\partial \omega_{12}} \\ \frac{\partial C}{\partial \omega_{121}} & \frac{\partial C}{\partial \omega_{22}} \\ \frac{\partial C}{\partial \omega_{31}} & \frac{\partial C}{\partial \omega_{32}} \end{bmatrix} &= \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \begin{bmatrix} g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) & g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \end{bmatrix} \\\ &= x^T \begin{bmatrix} g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) & g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \end{bmatrix} \\\ &= x^T (\begin{bmatrix} g\prime(a_1) & g\prime(a_2) \end{bmatrix} \cdot * \begin{bmatrix} 2 \cdot (\hat{y}_1 - y_1) & 2 \cdot (\hat{y}_2 - y_2)\end{bmatrix}) \end{aligned} \tag{25}
⎣⎡∂ω11∂C∂ω121∂C∂ω31∂C∂ω12∂C∂ω22∂C∂ω32∂C⎦⎤ =⎣⎡x1x2x3⎦⎤[g′(a1)⋅2⋅(y^1−y1)g′(a2)⋅2⋅(y^2−y2)]=xT[g′(a1)⋅2⋅(y^1−y1)g′(a2)⋅2⋅(y^2−y2)]=xT([g′(a1)g′(a2)]⋅∗[2⋅(y^1−y1)2⋅(y^2−y2)])(25)
注:上面公式中的
⋅
∗
\cdot*
⋅∗ 为向量(矩阵)点乘,即表示向量(矩阵)对应位置的数值分别相乘,和矩阵的相乘不同。
最后化简如下:
(26)
∂
C
∂
ω
=
x
T
(
[
g
′
(
a
1
)
g
′
(
a
2
)
]
⋅
∗
[
2
⋅
(
y
^
1
−
y
1
)
2
⋅
(
y
^
2
−
y
2
)
]
)
\frac{\partial C}{\partial \omega} = x^T (\begin{bmatrix} g\prime(a_1) & g\prime(a_2) \end{bmatrix} \cdot * \begin{bmatrix} 2 \cdot (\hat{y}_1 - y_1) & 2 \cdot (\hat{y}_2 - y_2)\end{bmatrix}) \tag{26}
∂ω∂C=xT([g′(a1)g′(a2)]⋅∗[2⋅(y^1−y1)2⋅(y^2−y2)])(26)
对于偏置量
b
b
b,我们计算梯度,然后进行整理,如下:(这里实际上进行了一定的简化,当我们设定偏置量为一个一维数组时,我们需要对下面的结果在列方向上取均值,以保证最后的结果可以和偏置量进行维度上的匹配。)
(27)
∂
C
∂
b
=
[
∂
C
b
1
∂
C
b
2
]
=
[
g
′
(
a
1
)
⋅
2
⋅
(
y
^
1
−
y
1
)
g
′
(
a
2
)
⋅
2
⋅
(
y
^
2
−
y
2
)
]
=
[
g
′
(
a
1
)
g
′
(
a
2
)
]
⋅
∗
[
2
⋅
(
y
^
1
−
y
1
)
2
⋅
(
y
^
2
−
y
2
)
]
\begin{aligned} \frac{\partial C}{\partial b} &= \begin{bmatrix} \frac{\partial C}{b_1} & \frac{\partial C}{b_2} \end{bmatrix} \\\ &= \begin{bmatrix} g\prime(a_1) \cdot 2 \cdot (\hat{y}_1 - y_1) & g\prime(a_2) \cdot 2 \cdot (\hat{y}_2 - y_2) \end{bmatrix} \\\ &= \begin{bmatrix} g\prime(a_1) & g\prime(a_2) \end{bmatrix} \cdot * \begin{bmatrix} 2 \cdot (\hat{y}_1 - y_1) & 2 \cdot (\hat{y}_2 - y_2)\end{bmatrix} \end{aligned} \tag{27}
∂b∂C =[b1∂Cb2∂C]=[g′(a1)⋅2⋅(y^1−y1)g′(a2)⋅2⋅(y^2−y2)]=[g′(a1)g′(a2)]⋅∗[2⋅(y^1−y1)2⋅(y^2−y2)](27)
以上就是单层全连接层,使用激活函数激活的梯度下降公式。
七、代码
这里我使用了两个训练样本,偏置量设定为一个1-D的数组,因此在更新参数时,需要对返回的结果取均值。详细请见backward函数。
import numpy as np
param = {}
nodes = {}
learning_rate = 0.1
def sigmoid(x):
return 1.0 / (1. + np.exp(- x))
def sigmoid_gradient(x):
sig = sigmoid(x)
return sig * (1. - sig)
def cost(y_pred, y):
return np.sum((y_pred - y) ** 2)
def cost_gradient(y_pred, y):
return 2 * (y_pred - y)
def forward(x):
nodes['matmul'] = np.matmul(x, param['w'])
nodes['bias'] = nodes['matmul'] + param['b']
nodes['sigmoid'] = sigmoid(nodes['bias'])
return nodes['sigmoid']
def backward(x, y_pred, y):
matrix = np.multiply(sigmoid_gradient(nodes['bias']), cost_gradient(y_pred, y))
matrix2 = np.mean(matrix, 0, keepdims=False)
param['w'] -= learning_rate * np.matmul(np.transpose(x), matrix)
param['b'] -= learning_rate * matrix2
def setup():
x = np.array([[1., 2., 3.],
[3., 2., 1.]])
y = np.array([[1., 0.],
[0., 1.]])
param['w'] = np.array([[.1, .2], [.3, .4], [.5, .6]])
param['b'] = np.array([0., 0.])
for i in range(1000):
y_pred = forward(x)
backward(x, y_pred, y)
print("梯度下降前:", y_pred, "\n梯度下降后:", forward(x), "\ncost:", cost(forward(x), y))
if __name__ == '__main__':
setup()
结果如下:可以看见,结果确实是在逐步想着目标结果靠近,cost值不断在减小。证明我们的算法是正确的。
梯度下降前: [[0.90024951 0.94267582]
[0.80218389 0.88079708]]
梯度下降后: [[0.87638772 0.93574933]
[0.74070904 0.87317416]]
cost: 1.4556414717601052
梯度下降前: [[0.87638772 0.93574933]
[0.74070904 0.87317416]]
梯度下降后: [[0.84537106 0.92722992]
[0.66043307 0.86435062]]
cost: 1.3382380273209387
梯度下降前: [[0.84537106 0.92722992]
[0.66043307 0.86435062]]
梯度下降后: [[0.80943371 0.91658752]
[0.56909364 0.85403973]]
cost: 1.2216201634346886
梯度下降前: [[0.80943371 0.91658752]
[0.56909364 0.85403973]]
梯度下降后: [[0.77530479 0.90307287]
[0.48379806 0.84187495]]
cost: 1.1250926413642042
梯度下降前: [[0.77530479 0.90307287]
[0.48379806 0.84187495]]
梯度下降后: [[0.74994151 0.88562481]
[0.41750738 0.8273968 ]]
cost: 1.050964830757349
梯度下降前: [[0.74994151 0.88562481]
[0.41750738 0.8273968 ]]
梯度下降后: [[0.73518018 0.86276177]
[0.37075203 0.81005788]]
cost: 0.9880224900744513
梯度下降前: [[0.73518018 0.86276177]
[0.37075203 0.81005788]]
梯度下降后: [[0.7288795 0.83251337]
[0.33814879 0.78928408]]
cost: 0.9253306342190869
梯度下降前: [[0.7288795 0.83251337]
[0.33814879 0.78928408]]
梯度下降后: [[0.72817698 0.7925729 ]
[0.31464068 0.76467358]]
cost: 0.8564368364354394
梯度下降前: [[0.72817698 0.7925729 ]
[0.31464068 0.76467358]]
梯度下降后: [[0.73084978 0.74107485]
[0.29686131 0.73646017]]
cost: 0.7792136510576879
梯度下降前: [[0.73084978 0.74107485]
[0.29686131 0.73646017]]
梯度下降后: [[0.73542993 0.67843692]
[0.28276129 0.70627592]]
cost: 0.6965017597370333
梯度下降前: [[0.73542993 0.67843692]
[0.28276129 0.70627592]]
梯度下降后: [[0.74100699 0.60952933]
[0.27110861 0.67770711]]
cost: 0.6159759746541653
梯度下降前: [[0.74100699 0.60952933]
[0.27110861 0.67770711]]
梯度下降后: [[0.7470327 0.54300877]
[0.2611523 0.65537568]]
cost: 0.5458174231304158
梯度下降前: [[0.7470327 0.54300877]
[0.2611523 0.65537568]]
梯度下降后: [[0.75318337 0.48629069]
[0.25242337 0.64233397]]
cost: 0.48903961977358096
梯度下降前: [[0.75318337 0.48629069]
[0.25242337 0.64233397]]
梯度下降后: [[0.75927196 0.44162035]
[0.24462032 0.63846146]]
cost: 0.4435277424325401
梯度下降前: [[0.75927196 0.44162035]
[0.24462032 0.63846146]]
梯度下降后: [[0.76519387 0.40729807]
[0.23754304 0.64153151]]
cost: 0.4059519984224861
梯度下降前: [[0.76519387 0.40729807]
[0.23754304 0.64153151]]
梯度下降后: [[0.77089406 0.38056044]
[0.23105412 0.64898998]]
cost: 0.3739098246421919
梯度下降前: [[0.77089406 0.38056044]
[0.23105412 0.64898998]]
梯度下降后: [[0.77634715 0.35906729]
[0.22505587 0.65883242]]
cost: 0.3459953751213052
梯度下降前: [[0.77634715 0.35906729]
[0.22505587 0.65883242]]
梯度下降后: [[0.78154526 0.34118504]
[0.21947641 0.66972652]]
cost: 0.3213801718742878
梯度下降前: [[0.78154526 0.34118504]
[0.21947641 0.66972652]]
梯度下降后: [[0.78649079 0.32585178]
[0.21426107 0.68086606]]
cost: 0.29951983756020983
......
梯度下降前: [[0.97352909 0.02666433]
[0.02647091 0.97333567]]
梯度下降后: [[0.97354315 0.02664997]
[0.02645685 0.97335003]]
cost: 0.002820371366327034
梯度下降前: [[0.97354315 0.02664997]
[0.02645685 0.97335003]]
梯度下降后: [[0.97355719 0.02663563]
[0.02644281 0.97336437]]
cost: 0.002817357738697952
梯度下降前: [[0.97355719 0.02663563]
[0.02644281 0.97336437]]
梯度下降后: [[0.97357121 0.02662131]
[0.02642879 0.97337869]]
cost: 0.002814350458880144
梯度下降前: [[0.97357121 0.02662131]
[0.02642879 0.97337869]]
梯度下降后: [[0.9735852 0.02660701]
[0.0264148 0.97339299]]
cost: 0.0028113495069739336
梯度下降前: [[0.9735852 0.02660701]
[0.0264148 0.97339299]]
梯度下降后: [[0.97359917 0.02659274]
[0.02640083 0.97340726]]
cost: 0.002808354863162452
梯度下降前: [[0.97359917 0.02659274]
[0.02640083 0.97340726]]
梯度下降后: [[0.97361312 0.02657849]
[0.02638688 0.97342151]]
cost: 0.0028053665077110495
梯度下降前: [[0.97361312 0.02657849]
[0.02638688 0.97342151]]
梯度下降后: [[0.97362705 0.02656426]
[0.02637295 0.97343574]]
cost: 0.002802384420967047
梯度下降前: [[0.97362705 0.02656426]
[0.02637295 0.97343574]]
梯度下降后: [[0.97364096 0.02655005]
[0.02635904 0.97344995]]
cost: 0.002799408583359122
梯度下降前: [[0.97364096 0.02655005]
[0.02635904 0.97344995]]
梯度下降后: [[0.97365484 0.02653587]
[0.02634516 0.97346413]]
cost: 0.002796438975397068
梯度下降前: [[0.97365484 0.02653587]
[0.02634516 0.97346413]]
梯度下降后: [[0.97366871 0.02652171]
[0.02633129 0.97347829]]
cost: 0.002793475577671274
梯度下降前: [[0.97366871 0.02652171]
[0.02633129 0.97347829]]
梯度下降后: [[0.97368255 0.02650757]
[0.02631745 0.97349243]]
cost: 0.0027905183708523346
梯度下降前: [[0.97368255 0.02650757]
[0.02631745 0.97349243]]
梯度下降后: [[0.97369637 0.02649345]
[0.02630363 0.97350655]]
cost: 0.002787567335690624
梯度下降前: [[0.97369637 0.02649345]
[0.02630363 0.97350655]]
梯度下降后: [[0.97371017 0.02647935]
[0.02628983 0.97352065]]
cost: 0.0027846224530159356