前提条件
- 使用交叉熵损失函数
- 使用sigmod激活函数
- 输出层使用softmax函数
网络结构
- 全连接,输出层没有sigmoid函数,使用softmax函数
目标
- 推导前提条件下的四个反向传播方程
- Python实现手写数字识别(两个隐藏层,第一隐藏层192个神经元,第二个隐藏层30个神经元)
其中,wijlw^{l}_{ij}wijl表示第lll层的第jjj个神经元到第l+1l+1l+1层第iii个神经元的权重,可以将wijlw^{l}_{ij}wijl理解为(Wl)T(W^{l})^{T}(Wl)T对应的元素;bilb^{l}_{i}bil表示第lll层到第l+1l+1l+1层第iii个神经元的偏置。
BP方程推导过程
- 推导方式:从某个神经元jjj拓展到整个层的神经元
- 引入中间变量:δjl=∂E∂zjl\delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}}δjl=∂zjl∂E,表示第𝑙层第𝑗个神经元的误差。其中EEE表示损失函数。
BP1
定义网络的输出层为第LLL层,则由softmaxsoftmaxsoftmax函数可得,第LLL层第jjj个神经元的输出为:
hjL=ezjL∑j=1dLezjLh^{L}_{j}=\frac{e^{z^{L}_{j}}}{\sum_{j=1}^{d_{L}}e^{z^{L}_{j}}\qquad}hjL=∑j=1dLezjLezjL(1)
因为损失函数为交叉熵损失函数,则
E=∑j=1dL−yjlnhjLE=\sum_{j=1}^{d_{L}}-y_{j}lnh^{L}_{j}\qquadE=∑j=1dL−yjlnhjL(2)
又因为标签yyy是one-hot(yi=1y_{i}=1yi=1,其余项均为0),所以
E=−lnhiLE=-lnh^{L}_{i}E=−lnhiL(3)
进一步地,由式(1)和式(3),可得
E=−lnhiL=−lneziL∑j=1dLezjL=−ziL+ln∑j=1dLezjLE=-lnh^{L}_{i}=-ln\frac{e^{z^{L}_{i}}}{\sum_{j=1}^{d_{L}}e^{z^{L}_{j}}\qquad}=-z^{L}_{i}+ln{\sum_{j=1}^{d_{L}}e^{z^{L}_{j}}\qquad}E=−lnhiL=−ln∑j=1dLezjLeziL=−ziL+ln∑j=1dLezjL(4)
下面对δjl=∂E∂zjl\delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}}δjl=∂zjl∂E进行分类讨论,当j=ij=ij=i时,有
δil=∂E∂zil=−1+hiL\delta^{l}_{i}=\frac{\partial\\E}{\partial\\z^{l}_{i}}=-1+h^{L}_{i}δil=∂zil∂E=−1+hiL(5)
当j≠ij{\neq}ij=i时,有
δjl=∂E∂zjl=0+hjL\delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}}=0+h^{L}_{j}δjl=∂zjl∂E=0+hjL(6)
联合式(5)和式(6)可得
δjl=∂E∂zjl=hjL−yj\delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}}=h^{L}_{j}-y_{j}δjl=∂zjl∂E=hjL−yj(7)
扩展到矩阵运算,得到BP1:
δL=∂E∂zL=hL−y\delta^{L}=\frac{\partial\\E}{\partial\\z^{L}}=h^{L}-yδL=∂zL∂E=hL−y,δL∈RdL×1\delta^{L}{\in}R^{d_{L}{\times}1}δL∈RdL×1(8)
BP2
BP2是整个反向传播方程最关键的一步。BP2寻找网络层与层之间的联系,即推导δl+1\delta^{l+1}δl+1和δl\delta^{l}δl之间的关系,又由δl+1\delta^{l+1}δl+1和δl\delta^{l}δl的定义是EEE关于zl+1z^{l+1}zl+1和zlz^{l}zl的误差,最终转化为寻找zl+1z^{l+1}zl+1和zlz^{l}zl的联系。
由上图可得,第l+1l+1l+1层的第jjj个神经元zjl+1z^{l+1}_{j}zjl+1接收由第lll层所有神经元发送的权重(多对一的关系),即
zjl+1=∑k=1dlwjklhkl+bjlz^{l+1}_{j}={\sum_{k=1}^{d_{l}}{w^{l}_{jk}}{h_{k}^{l}}\qquad}+b^{l}_{j}zjl+1=∑k=1dlwjklhkl+bjl(9)
而第lll层第jjj个神经元zjlz^{l}_{j}zjl和hjlh^{l}_{j}hjl之间的差别仅仅是一个sigmoid函数
hjl=σ(zjl)h^{l}_{j}=\sigma(z^{l}_{j})hjl=σ(zjl)(10)
联合式(9)和式(10)可以得到zlz^{l}zl和zl+1z^{l+1}zl+1的关系
zjl+1=∑k=1dlwjklσ(zkl)+bjlz^{l+1}_{j}={\sum_{k=1}^{d_{l}}{w^{l}_{jk}}\sigma(z^{l}_{k})\qquad}+b^{l}_{j}zjl+1=∑k=1dlwjklσ(zkl)+bjl(11)
这里需要注意的是,第lll层的第jjj个神经元向第l+1l+1l+1层所有的神经元都发送了权重(一对多的关系),所以
δjl=∂E∂zjl=∑k=1dl+1∂E∂zkl+1∂zl+1∂zjl=∑k=1dl+1δkl+1wkjlσ,(zjl)\delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}}={\sum_{k=1}^{d_{l+1}}\frac{\partial\\E}{\partial\\z^{l+1}_{k}}\frac{\partial\\z^{l+1}}{\partial\\z^{l}_{j}}\qquad}\\={\sum_{k=1}^{d_{l+1}}\delta^{l+1}_{k}w^{l}_{kj}\sigma^{,}(z^{l}_{j})\qquad}δjl=∂zjl∂E=∑k=1dl+1∂zkl+1∂E∂zjl∂zl+1=∑k=1dl+1δkl+1wkjlσ,(zjl)(12)
扩展到矩阵运算,其中δl+1∈Rdl+1×1\delta^{l+1}\in{R^{d_{l+1}\times1}}δl+1∈Rdl+1×1;Wl∈Rdl×dl+1W^{l}\in{R^{d_{l}\times{d_{l+1}}}}Wl∈Rdl×dl+1;zl∈Rdl×1z^{l}\in{R^{d_{l}\times1}}zl∈Rdl×1,得到BP2:
δl=∂E∂zl=σ,(zl)⊙(Wlδl+1)\delta^{l}=\frac{\partial\\E}{\partial\\z^{l}}=\sigma^{,}(z^{l})\odot(W^{l}\delta^{l+1})δl=∂zl∂E=σ,(zl)⊙(Wlδl+1),δl∈Rdl×1\delta^{l}\in{R^{d_{l}\times1}}δl∈Rdl×1(13)
(⊙\odot⊙表示哈达玛乘积,即矩阵对应元素相乘)
BP2比较复杂,字母下标很容易搞混淆,关键在于理解层与层之间多对多的关系。
BP3
得到BP2之后,开始正式推导∂E∂bjl−1\frac{\partial\\E}{\partial\\b^{l-1}_{j}}∂bjl−1∂E(写成∂E∂bjl−1\frac{\partial\\E}{\partial\\b^{l-1}_{j}}∂bjl−1∂E而不是∂E∂bjl\frac{\partial\\E}{\partial\\b^{l}_{j}}∂bjl∂E的形式只是为了对应BP2),这是反向传播过程中真正需要更新的参数。
由上文中的式(9)
zjl+1=∑k=1dlwjklhkl+bjlz^{l+1}_{j}={\sum_{k=1}^{d_{l}}{w^{l}_{jk}}{h_{k}^{l}}\qquad}+b^{l}_{j}zjl+1=∑k=1dlwjklhkl+bjl(9)
进行简单地变换得到
zjl=∑k=1dl−1wjkl−1hkl−1+bjl−1z^{l}_{j}={\sum_{k=1}^{d_{l-1}}{w^{l-1}_{jk}}{h_{k}^{l-1}}\qquad}+b^{l-1}_{j}zjl=∑k=1dl−1wjkl−1hkl−1+bjl−1(14)
推导可得
∂E∂bjl−1=∂E∂zjl∂zjl∂bjl−1=δjl\frac{\partial\\E}{\partial\\b^{l-1}_{j}}=\frac{\partial\\E}{\partial\\z^{l}_{j}}\frac{\partial\\z^{l}_{j}}{\partial\\b^{l-1}_{j}}=\delta^{l}_{j}∂bjl−1∂E=∂zjl∂E∂bjl−1∂zjl=δjl
扩展到矩阵运算,得到BP3:
∂E∂bl−1=∂E∂zl∂zl∂bl−1=δl\frac{\partial\\E}{\partial\\b^{l-1}}=\frac{\partial\\E}{\partial\\z^{l}}\frac{\partial\\z^{l}}{\partial\\b^{l-1}}=\delta^{l}∂bl−1∂E=∂zl∂E∂bl−1∂zl=δl
BP4
BP4由式(14)推导∂E∂wjkl−1\frac{\partial\\E}{\partial\\w^{l-1}_{jk}}∂wjkl−1∂E
zjl=∑k=1dl−1wjkl−1hkl−1+bjl−1z^{l}_{j}={\sum_{k=1}^{d_{l-1}}{w^{l-1}_{jk}}{h_{k}^{l-1}}\qquad}+b^{l-1}_{j}zjl=∑k=1dl−1wjkl−1hkl−1+bjl−1(14)
对于具体的每一个参数wjkl−1{w^{l-1}_{jk}}wjkl−1,有
∂E∂wjkl−1=∂E∂zjl∂zjl∂wjkl−1=δjlhkl−1\frac{\partial\\E}{\partial\\w^{l-1}_{jk}}=\frac{\partial\\E}{\partial\\z^{l}_{j}}\frac{\partial\\z^{l}_{j}}{\partial\\w^{l-1}_{jk}}=\delta^{l}_{j}h_{k}^{l-1}∂wjkl−1∂E=∂zjl∂E∂wjkl−1∂zjl=δjlhkl−1
扩展到矩阵运算,δl∈Rdl×1\delta^{l}\in{R^{d_{l}\times1}}δl∈Rdl×1;hl−1∈Rdl−1×1h^{l-1}\in{R^{d_{l-1}\times1}}hl−1∈Rdl−1×1,得到BP4:
∂E∂Wl−1=∂E∂zl∂zl∂Wl−1=hl−1(δl)T\frac{\partial\\E}{\partial\\W^{l-1}}=\frac{\partial\\E}{\partial\\z^{l}}\frac{\partial\\z^{l}}{\partial\\W^{l-1}}=h^{l-1}(\delta^{l})^{T}∂Wl−1∂E=∂zl∂E∂Wl−1∂zl=hl−1(δl)T,∂E∂Wl−1∈Rdl−1×dl\frac{\partial\\E}{\partial\\W^{l-1}}\in{R^{d_{l-1}\times{d_{l}}}}∂Wl−1∂E∈Rdl−1×dl
BP方程
- BP1:δL=∂E∂zL=hL−y\delta^{L}=\frac{\partial\\E}{\partial\\z^{L}}=h^{L}-yδL=∂zL∂E=hL−y,δL∈RdL×1\delta^{L}{\in}R^{d_{L}{\times}1}δL∈RdL×1
- BP2:δl=∂E∂zl=σ,(zl)⊙(Wlδl+1)\delta^{l}=\frac{\partial\\E}{\partial\\z^{l}}=\sigma^{,}(z^{l})\odot(W^{l}\delta^{l+1})δl=∂zl∂E=σ,(zl)⊙(Wlδl+1),δl∈Rdl×1\delta^{l}\in{R^{d_{l}\times1}}δl∈Rdl×1(13)
- BP3:∂E∂bl−1=∂E∂zl∂zl∂bl−1=δl\frac{\partial\\E}{\partial\\b^{l-1}}=\frac{\partial\\E}{\partial\\z^{l}}\frac{\partial\\z^{l}}{\partial\\b^{l-1}}=\delta^{l}∂bl−1∂E=∂zl∂E∂bl−1∂zl=δl
- BP4:∂E∂Wl−1=∂E∂zl∂zl∂Wl−1=hl−1(δl)T\frac{\partial\\E}{\partial\\W^{l-1}}=\frac{\partial\\E}{\partial\\z^{l}}\frac{\partial\\z^{l}}{\partial\\W^{l-1}}=h^{l-1}(\delta^{l})^{T}∂Wl−1∂E=∂zl∂E∂Wl−1∂zl=hl−1(δl)T,∂E∂Wl−1∈Rdl−1×dl\frac{\partial\\E}{\partial\\W^{l-1}}\in{R^{d_{l-1}\times{d_{l}}}}∂Wl−1∂E∈Rdl−1×dl
问题
- 如果输出层使用sigmoid函数,均方误差损失函数,BP方程会有什么变化?
答:仅BP1发生变化 - 如果改用Relu函数作为激活函数,BP方程会有什么变化?
答:仅BP2发生变化
代码实现
- Python实现手写数字识别(两个隐藏层,第一隐藏层192个神经元,第二个隐藏层30个神经元)
- 输入图像像素为28×28=78428\times28=78428×28=784
import _pickle as cPickle
import gzip
import numpy as np
def softmax(x, axis=0):
# 计算每行的最大值
row_max = x.max(axis=axis)
# 每行元素都需要减去对应的最大值,否则求exp(x)会溢出,导致inf情况
row_max = row_max.reshape(-1, 1)
x = x - row_max
# 计算e的指数次幂
x_exp = np.exp(x)
x_sum = np.sum(x_exp, axis=axis, keepdims=True)
s = x_exp / x_sum
return s
def sigmoid(x):
return 1.0 / (1 + np.exp(-x))
def sigmoid_derivation(x):
result = sigmoid(x)
return result * (1 - result)
class NeuralNetwork:
"""初始化网络"""
def __init__(self, input_layer=784, hidden_layer1=192, hidden_layer2=30, output_layer=10,
learning_rate=0.1):
"""初始化网络结构"""
# 输入层的节点数
self.input_layer = input_layer
# 第一个隐层的节点数
self.hidden_layer1 = hidden_layer1
# 第二个隐层的节点数
self.hidden_layer2 = hidden_layer2
# 输出层的节点数
self.output_layer = output_layer
# 损失函数
self.activation = sigmoid
self.activation_derivation = sigmoid_derivation
# 学习率
self.learning_rate = learning_rate
"""初始化权重和偏置矩阵"""
# 输入层到第一个隐层
self.W1 = np.random.randn(self.input_layer, self.hidden_layer1)
self.b1 = np.random.randn(self.hidden_layer1, 1)
# 第一个隐层到第二个隐层
self.W2 = np.random.randn(self.hidden_layer1, self.hidden_layer2)
self.b2 = np.random.randn(self.hidden_layer2, 1)
# 第二个隐层到输出层
self.W3 = np.random.randn(self.hidden_layer2, self.output_layer)
self.b3 = np.random.randn(self.output_layer, 1)
"""前向传播"""
def forward(self, x):
z2 = np.dot(self.W1.T, x) + self.b1
h2 = self.activation(z2)
z3 = np.dot(self.W2.T, h2) + self.b2
h3 = self.activation(z3)
z4 = np.dot(self.W3.T, h3) + self.b3
out = softmax(z4)
return z2, h2, z3, h3, z4, out
"""训练网络"""
def train(self, data, epochs=1):
X, Y = data
count = len(Y)
# 将Y转换为one-hot
new_Y = np.zeros((count, 10))
# 训练
for epoch in range(epochs):
for i in range(count):
new_Y[i][Y[i]] = 1.0
# 将X,new_Y转化成二维矩阵
x = np.array(X[i], ndmin=2).T
target = np.array(new_Y[i], ndmin=2).T
"""前向传播"""
z2, h2, z3, h3, z4, out = self.forward(x)
"""计算损失"""
loss = (-np.log(out) * target).sum()
print('loss:' + str(loss))
# if loss < 1e-3:
# break
"""反向传播,根据BP方程更新W和b"""
# BP1
delta_z4 = out - target
# 更新第二个隐层到输出层的W,b z4 = W3.T @ h3 + b3
delta_W3 = np.dot(h3, delta_z4.T)
self.W3 -= self.learning_rate * delta_W3
delta_b3 = delta_z4
self.b3 -= self.learning_rate * delta_b3
# 更新第一个隐层到第二个隐层的W,b z3 = W2.T @ h2 + b2, h3 = sigmoid(z3)
delta_h3 = np.dot(self.W3, delta_z4)
delta_z3 = np.multiply(self.activation_derivation(z3), delta_h3)
delta_W2 = np.dot(h2, delta_z3.T)
self.W2 -= self.learning_rate * delta_W2
delta_b2 = delta_z3
self.b2 -= self.learning_rate * delta_b2
# 更新输入层到第一个隐层的W,b z2 = W1.T @ x + b1, h2 = sigmoid(z2)
delta_h2 = np.dot(self.W2, delta_z3)
delta_z2 = np.multiply(self.activation_derivation(z2), delta_h2)
delta_W1 = np.dot(x, delta_z2.T)
self.W1 -= self.learning_rate * delta_W1
delta_b1 = delta_z2
self.b1 -= self.learning_rate * delta_b1
"""测试"""
def test(self, data):
X, Y = data
count = len(Y)
# 预测
res = 0
for i in range(0, count):
x = np.array(X[i], ndmin=2).T
z2, h2, z3, h3, z4, out = self.forward(x)
# 取out的最大值下标作为预测结果
predict = np.argmax(out)
# print(predict, Y[i])
if int(predict) == int(Y[i]):
res += 1
rating = res / count * 100
print("correct rating: %.2f" % rating + '%')
if __name__ == "__main__":
# 实例化神经网络
net = NeuralNetwork()
# 读取数据
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(f, encoding='latin1')
f.close()
net.train(training_data)
net.test(test_data)
逻辑回归
- 10个分类器实现手写数字识别
import _pickle as cPickle
import gzip
import numpy as np
import torch
from torch import nn
nets = []
class LogisticRegression(nn.Module):
def __init__(self, input_layer=784, hidden_layer1=192, hidden_layer2=30, output_layer=1):
super(LogisticRegression, self).__init__()
self.fc1 = nn.Linear(input_layer, hidden_layer1)
self.fc2 = nn.Linear(hidden_layer1, hidden_layer2)
self.fc3 = nn.Linear(hidden_layer2, output_layer)
self.activation = nn.Sigmoid()
def forward(self, x):
z2 = self.fc1(x)
h2 = self.activation(z2)
z3 = self.fc2(h2)
h3 = self.activation(z3)
z4 = self.fc3(h3)
out = self.activation(z4)
return out
def train(data, epochs=1):
X, Y = data
count = len(Y)
for i in range(10):
net = LogisticRegression()
# 定义损失函数和优化器
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=1e-3, momentum=0.9)
tmp_Y = np.where(Y == i, np.ones_like(Y), np.zeros_like(Y))
for epoch in range(epochs):
for j in range(count):
x = torch.from_numpy(X[j])
y = torch.from_numpy(tmp_Y[j:j + 1].astype(np.float32))
out = net(x)
loss = criterion(out, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("loss:{:.4f}".format(loss))
nets.append(net)
def test(data):
X, Y = data
count = len(Y)
res = 0
with torch.no_grad():
for n in range(count):
out = 0
index = 0
print(Y[n])
for i in range(len(nets)):
x = torch.from_numpy(X[n])
x_out = nets[i](x)
print('{} {}'.format(i, x_out.item()))
if x_out > out:
out = x_out
index = i
if index == Y[n]:
res += 1
print(res / count)
if __name__ == "__main__":
# 实例化神经网络
# 读取数据
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(f, encoding='latin1')
f.close()
train(training_data)
test(test_data)