约定
本小节所使用的神经网络如图所示
其中
- 一共有三层, 序号分别是 0, 1, 2 (第0层为输入层)
- 输入为向量 x x x 有两个项 x 1 x_1 x1 和 x 2 x_2 x2
- a ( i ) a^{(i)} a(i) 表示第 i 层的输出向量, a j ( i ) a_j^{(i)} aj(i) 表示第 i 层输出向量的第 j 项. 注意 x = a ( 0 ) x = a^{(0)} x=a(0)
- W ( i ) W^{(i)} W(i) 表示连接第 i-1 层和第 i 层的边, 即系数, w j k ( i ) w_{jk}^{(i)} wjk(i) 表示连接第 i 层第 j 个节点和第 i-1 层第 k 个节点的边, 注意 jk 的顺序是从右到左连接
- z ( i ) = W ( i ) × a ( i − 1 ) z^{(i)} = W^{(i)} \times a^{(i-1)} z(i)=W(i)×a(i−1)
- a ( i ) = s i g m o i d ( z ( i ) ) a^{(i)} = sigmoid(z^{(i)}) a(i)=sigmoid(z(i)) 也就是说本网络使用的激活函数是 sigmoid
- s i g m o i d ( x ) = 1 1 + e − x sigmoid(x) = \frac{1}{1 + e^{-x}} sigmoid(x)=1+e−x1
- 损失函数为 J ( W , x , y ) = ∑ i = 1 2 [ y l n ( a i ( 2 ) ) + ( 1 − y ) l n ( 1 − a i ( 2 ) ) ] J(W, x, y) = \sum_{i=1}^2[yln(a_i^{(2)}) + (1-y)ln(1 - a_i^{(2)})] J(W,x,y)=∑i=12[yln(ai(2))+(1−y)ln(1−ai(2))]
- 简便起见不考虑 bias 项
推导
我们的目标是最小化 J ( W , x , y ) J(W, x, y) J(W,x,y), 其中 W 是所有的边, 而 x, y 是测试数据集为常量, 所以我们需要求 δ J δ W \frac{\delta J}{\delta W} δWδJ, 具体地, 我们要求 δ J δ w ( 2 ) \frac{\delta J}{\delta w^{(2)}} δw(2)δJ 和 δ J δ w ( 1 ) \frac{\delta J}{\delta w^{(1)}} δw(1)δJ :
δ J δ w ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) × δ z ( 2 ) δ w ( 2 ) \frac{\delta J}{\delta w^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta w^{(2)}} δw(2)δJ=δa(2)δJ×δz(2)δa(2)×δw(2)δz(2)
δ J δ a ( 2 ) = ( y l n ( a ( 2 ) ) + ( 1 − y ) l n ( 1 − a ( 2 ) ) ) ′ = y a ( 2 ) − 1 − y 1 − a ( 2 ) \frac{\delta J}{\delta a^{(2)}} = (yln(a^{(2)}) + (1-y)ln(1-a^{(2)}))^{'} = \frac{y}{a^{(2)}} - \frac{1-y}{1 - a^{(2)}} δa(2)δJ=(yln(a(2))+(1−y)ln(1−a(2)))′=a(2)y−1−a(2)1−y
δ a ( 2 ) δ z ( 2 ) = a ( 2 ) ⋅ ( 1 − a ( 2 ) ) \frac{\delta a^{(2)}}{\delta z^{(2)}} = a^{(2)} \cdot (1-a^{(2)}) δz(2)δa(2)=a(2)⋅(1−a(2))
由上述二式可得
δ J δ z ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) = y ⋅ ( 1 − a ( 2 ) ) − ( 1 − y ) ⋅ a ( 2 ) \frac{\delta J}{\delta z^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} = y \cdot (1-a^{(2)}) - (1-y)\cdot a^{(2)} δz(2)δJ=δa(2)δJ×δz(2)δa(2)=y⋅(1−a(2))−(1−y)⋅a(2)
因为 y 的值只可能是 0 或 1, 当 y = 0 时, 上式等于
− a ( 2 ) = 0 − a ( 2 ) = y − a ( 2 ) -a^{(2)} = 0-a^{(2)} = y - a^{(2)} −a(2)=0−a(2)=y−a(2)
当 y = 1 时, 上式等于
1 − a ( 2 ) = y − a ( 2 ) 1 - a^{(2)} = y - a^{(2)} 1−a(2)=y−a(2)
所以
δ J δ z ( 2 ) = a ( 2 ) − y \frac{\delta J}{\delta z^{(2)}} = a^{(2)} - y δz(2)δJ=a(2)−y
令 δ ( 2 ) = δ J δ z ( 2 ) = a ( 2 ) − y \delta^{(2)} = \frac{\delta J}{\delta z^{(2)}} = a^{(2)} - y δ(2)=δz(2)δJ=a(2)−y
因为
z ( 2 ) = W ( 2 ) × a ( 1 ) z^{(2)} = W^{(2)} \times a^{(1)} z(2)=W(2)×a(1)
所以
δ z ( 2 ) δ W ( 2 ) = a ( 1 ) \frac{\delta z^{(2)}}{\delta W^{(2)}} = a^{(1)} δW(2)δz(2)=a(1)
综上
δ J δ w ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) × δ z ( 2 ) δ W ( 2 ) = δ ( 2 ) × a ( 1 ) \frac{\delta J}{\delta w^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta W^{(2)}} = \delta^{(2)} \times a^{(1)} δw(2)δJ=δa(2)δJ×δz(2)δa(2)×δW(2)δz(2)=δ(2)×a(1)
现在开始求 δ J δ w ( 1 ) \frac{\delta J}{\delta w^{(1)}} δw(1)δJ:
δ J δ w ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) × δ z ( 1 ) δ W ( 1 ) \frac{\delta J}{\delta w^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} \times \frac{\delta z^{(1)}}{\delta W^{(1)}} δw(1)δJ=δa(1)δJ×δz(1)δa(1)×δW(1)δz(1)
因为
z ( 2 ) = W ( 2 ) × a ( 1 ) z^{(2)} = W^{(2)} \times a^{(1)} z(2)=W(2)×a(1)
所以
δ z ( 2 ) δ a ( 1 ) = W ( 2 ) \frac{\delta z^{(2)}}{\delta a^{(1)}} = W^{(2)} δa(1)δz(2)=W(2)
所以
δ J δ a ( 1 ) = δ J δ z ( 2 ) × δ z ( 2 ) δ a ( 1 ) = W ( 2 ) T × δ ( 2 ) \frac{\delta J}{\delta a^{(1)}} = \frac{\delta J}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta a^{(1)}} =W^{(2)T} \times \delta^{(2)} δa(1)δJ=δz(2)δJ×δa(1)δz(2)=W(2)T×δ(2)
因为
δ
a
(
1
)
δ
z
(
1
)
=
a
(
1
)
⋅
(
1
−
a
(
1
)
)
\frac{\delta a^{(1)}}{\delta z^{(1)}} = a^{(1)} \cdot(1-a^{(1)})
δz(1)δa(1)=a(1)⋅(1−a(1))
所以
δ J δ z ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) = W ( 2 ) T × δ ( 2 ) ⋅ a ( 1 ) ⋅ ( 1 − a ( 1 ) ) \frac{\delta J}{\delta z^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} = W^{(2)T} \times \delta^{(2)} \cdot a^{(1)} \cdot (1 - a^{(1)}) δz(1)δJ=δa(1)δJ×δz(1)δa(1)=W(2)T×δ(2)⋅a(1)⋅(1−a(1))
令 δ ( 1 ) = δ J δ z ( 1 ) = W ( 2 ) T × δ ( 2 ) ⋅ a ( 1 ) ⋅ ( 1 − a ( 1 ) ) \delta^{(1)} = \frac{\delta J}{\delta z^{(1)}} = W^{(2)T} \times \delta^{(2)} \cdot a^{(1)} \cdot (1 - a^{(1)}) δ(1)=δz(1)δJ=W(2)T×δ(2)⋅a(1)⋅(1−a(1))
又因为
z ( 1 ) = W ( 1 ) × a ( 0 ) z^{(1)} = W^{(1)} \times a^{(0)} z(1)=W(1)×a(0)
所以
δ z ( 1 ) δ W ( 1 ) = a ( 0 ) \frac{\delta z^{(1)}}{\delta W^{(1)}} = a^{(0)} δW(1)δz(1)=a(0)
综上
δ J δ w ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) × δ z ( 1 ) δ W ( 1 ) = δ ( 1 ) × a ( 0 ) \frac{\delta J}{\delta w^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} \times \frac{\delta z^{(1)}}{\delta W^{(1)}} = \delta^{(1)} \times a^{(0)} δw(1)δJ=δa(1)δJ×δz(1)δa(1)×δW(1)δz(1)=δ(1)×a(0)
神经网络实现
import numpy as np
import matplotlib.pyplot as plt
def init_model(X, Y, layers):
model = {}
nodes = []
nodes.append(X.shape[0])
for l in layers:
nodes.append(l)
nodes.append(Y.shape[0])
model['depth'] = len(layers) + 1
for n in range(model['depth']):
model['W' + str(n + 1)] = np.random.rand(nodes[n+1], nodes[n])
model['B' + str(n + 1)] = np.random.rand(nodes[n+1], 1)
return model
def sigmoid(x):
return 1.0 / (1 + np.exp(-x))
def forward_propagation(X, Y, model):
result = X.copy()
model['A0'] = result.copy()
for i in range(model['depth']):
W = model['W' + str(i+1)]
B = model['B' + str(i+1)]
result = np.dot(W, result) + B
model['Z' + str(i+1)] = result.copy()
result = sigmoid(result)
model['A' + str(i+1)] = result.copy()
return result
def cross_entropy(Y, result):
return -Y * np.log(result) - (1 - Y) * np.log(1 - result)
def cost(Y, result):
return np.sum(cross_entropy(Y, result)) / Y.shape[1]
def backward_propagation(Y, model, learning_rate):
L = model['depth']
m = Y.shape[1]
delta = model['A' + str(L)] - Y
model['dW' + str(L)] = np.dot(delta, model['A' + str(L-1)].T) / m
model['dB' + str(L)] = np.sum(delta, axis = 1, keepdims=True) / m
for l in range(L - 1, 0, -1):
delta = (np.dot(model['W' + str(l+1)].T, delta) * model['A' + str(l)] * (1 - model['A' + str(l)]))
model['dW' + str(l)] = np.dot(delta, model['A' + str(l-1)].T) / m
model['dB' + str(l)] = np.sum(delta, axis=1, keepdims=True) / m
for l in range(model['depth'], 0, -1):
model['W' + str(l)] -= learning_rate * model['dW' + str(l)]
model['B' + str(l)] -= learning_rate * model['dB' + str(l)]
def train(X, Y, model, learning_rate = 0.05, times = 50, show_cost_history = True):
cost_history = []
result = 0
for t in range(times):
result = forward_propagation(X, Y, model)
backward_propagation(Y, model, learning_rate)
cost_history.append(cost(Y, result))
if show_cost_history:
plt.plot(cost_history)
plt.show()
if __name__ == '__main__':
m = 1000
X = np.random.rand(2, m)
Y = np.zeros((1, m))
for i in range(m):
if (X[0, i] < 0.5) and (X[1, i] < 0.5):
Y[0, i] = 1
model = init_model(X, Y, [2])
train(X, Y, model)