机器学习 (三): 反向传播数学推导 & 神经网络

本文详细介绍了神经网络的反向传播算法,通过数学推导解释了权重更新的过程,涵盖了从损失函数到梯度计算的全部步骤,并探讨了在实际神经网络实现中的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

反向传播推导 & 神经网络

约定

本小节所使用的神经网络如图所示
神经网络
其中

  • 一共有三层, 序号分别是 0, 1, 2 (第0层为输入层)
  • 输入为向量 x x x 有两个项 x 1 x_1 x1 x 2 x_2 x2
  • a ( i ) a^{(i)} a(i) 表示第 i 层的输出向量, a j ( i ) a_j^{(i)} aj(i) 表示第 i 层输出向量的第 j 项. 注意 x = a ( 0 ) x = a^{(0)} x=a(0)
  • W ( i ) W^{(i)} W(i) 表示连接第 i-1 层和第 i 层的边, 即系数, w j k ( i ) w_{jk}^{(i)} wjk(i) 表示连接第 i 层第 j 个节点和第 i-1 层第 k 个节点的边, 注意 jk 的顺序是从右到左连接
  • z ( i ) = W ( i ) × a ( i − 1 ) z^{(i)} = W^{(i)} \times a^{(i-1)} z(i)=W(i)×a(i1)
  • a ( i ) = s i g m o i d ( z ( i ) ) a^{(i)} = sigmoid(z^{(i)}) a(i)=sigmoid(z(i)) 也就是说本网络使用的激活函数是 sigmoid
  • s i g m o i d ( x ) = 1 1 + e − x sigmoid(x) = \frac{1}{1 + e^{-x}} sigmoid(x)=1+ex1
  • 损失函数为 J ( W , x , y ) = ∑ i = 1 2 [ y l n ( a i ( 2 ) ) + ( 1 − y ) l n ( 1 − a i ( 2 ) ) ] J(W, x, y) = \sum_{i=1}^2[yln(a_i^{(2)}) + (1-y)ln(1 - a_i^{(2)})] J(W,x,y)=i=12[yln(ai(2))+(1y)ln(1ai(2))]
  • 简便起见不考虑 bias 项

推导

我们的目标是最小化 J ( W , x , y ) J(W, x, y) J(W,x,y), 其中 W 是所有的边, 而 x, y 是测试数据集为常量, 所以我们需要求 δ J δ W \frac{\delta J}{\delta W} δWδJ, 具体地, 我们要求 δ J δ w ( 2 ) \frac{\delta J}{\delta w^{(2)}} δw(2)δJ δ J δ w ( 1 ) \frac{\delta J}{\delta w^{(1)}} δw(1)δJ :

δ J δ w ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) × δ z ( 2 ) δ w ( 2 ) \frac{\delta J}{\delta w^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta w^{(2)}} δw(2)δJ=δa(2)δJ×δz(2)δa(2)×δw(2)δz(2)

δ J δ a ( 2 ) = ( y l n ( a ( 2 ) ) + ( 1 − y ) l n ( 1 − a ( 2 ) ) ) ′ = y a ( 2 ) − 1 − y 1 − a ( 2 ) \frac{\delta J}{\delta a^{(2)}} = (yln(a^{(2)}) + (1-y)ln(1-a^{(2)}))^{'} = \frac{y}{a^{(2)}} - \frac{1-y}{1 - a^{(2)}} δa(2)δJ=(yln(a(2))+(1y)ln(1a(2)))=a(2)y1a(2)1y

δ a ( 2 ) δ z ( 2 ) = a ( 2 ) ⋅ ( 1 − a ( 2 ) ) \frac{\delta a^{(2)}}{\delta z^{(2)}} = a^{(2)} \cdot (1-a^{(2)}) δz(2)δa(2)=a(2)(1a(2))

由上述二式可得

δ J δ z ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) = y ⋅ ( 1 − a ( 2 ) ) − ( 1 − y ) ⋅ a ( 2 ) \frac{\delta J}{\delta z^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} = y \cdot (1-a^{(2)}) - (1-y)\cdot a^{(2)} δz(2)δJ=δa(2)δJ×δz(2)δa(2)=y(1a(2))(1y)a(2)

因为 y 的值只可能是 0 或 1, 当 y = 0 时, 上式等于

− a ( 2 ) = 0 − a ( 2 ) = y − a ( 2 ) -a^{(2)} = 0-a^{(2)} = y - a^{(2)} a(2)=0a(2)=ya(2)

当 y = 1 时, 上式等于

1 − a ( 2 ) = y − a ( 2 ) 1 - a^{(2)} = y - a^{(2)} 1a(2)=ya(2)

所以

δ J δ z ( 2 ) = a ( 2 ) − y \frac{\delta J}{\delta z^{(2)}} = a^{(2)} - y δz(2)δJ=a(2)y

δ ( 2 ) = δ J δ z ( 2 ) = a ( 2 ) − y \delta^{(2)} = \frac{\delta J}{\delta z^{(2)}} = a^{(2)} - y δ(2)=δz(2)δJ=a(2)y

因为

z ( 2 ) = W ( 2 ) × a ( 1 ) z^{(2)} = W^{(2)} \times a^{(1)} z(2)=W(2)×a(1)

所以

δ z ( 2 ) δ W ( 2 ) = a ( 1 ) \frac{\delta z^{(2)}}{\delta W^{(2)}} = a^{(1)} δW(2)δz(2)=a(1)

综上

δ J δ w ( 2 ) = δ J δ a ( 2 ) × δ a ( 2 ) δ z ( 2 ) × δ z ( 2 ) δ W ( 2 ) = δ ( 2 ) × a ( 1 ) \frac{\delta J}{\delta w^{(2)}} = \frac{\delta J}{\delta a^{(2)}} \times \frac{\delta a^{(2)}}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta W^{(2)}} = \delta^{(2)} \times a^{(1)} δw(2)δJ=δa(2)δJ×δz(2)δa(2)×δW(2)δz(2)=δ(2)×a(1)


现在开始求 δ J δ w ( 1 ) \frac{\delta J}{\delta w^{(1)}} δw(1)δJ:

δ J δ w ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) × δ z ( 1 ) δ W ( 1 ) \frac{\delta J}{\delta w^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} \times \frac{\delta z^{(1)}}{\delta W^{(1)}} δw(1)δJ=δa(1)δJ×δz(1)δa(1)×δW(1)δz(1)

因为

z ( 2 ) = W ( 2 ) × a ( 1 ) z^{(2)} = W^{(2)} \times a^{(1)} z(2)=W(2)×a(1)

所以

δ z ( 2 ) δ a ( 1 ) = W ( 2 ) \frac{\delta z^{(2)}}{\delta a^{(1)}} = W^{(2)} δa(1)δz(2)=W(2)

所以

δ J δ a ( 1 ) = δ J δ z ( 2 ) × δ z ( 2 ) δ a ( 1 ) = W ( 2 ) T × δ ( 2 ) \frac{\delta J}{\delta a^{(1)}} = \frac{\delta J}{\delta z^{(2)}} \times \frac{\delta z^{(2)}}{\delta a^{(1)}} =W^{(2)T} \times \delta^{(2)} δa(1)δJ=δz(2)δJ×δa(1)δz(2)=W(2)T×δ(2)

因为

δ a ( 1 ) δ z ( 1 ) = a ( 1 ) ⋅ ( 1 − a ( 1 ) ) \frac{\delta a^{(1)}}{\delta z^{(1)}} = a^{(1)} \cdot(1-a^{(1)}) δz(1)δa(1)=a(1)(1a(1))
所以

δ J δ z ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) = W ( 2 ) T × δ ( 2 ) ⋅ a ( 1 ) ⋅ ( 1 − a ( 1 ) ) \frac{\delta J}{\delta z^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} = W^{(2)T} \times \delta^{(2)} \cdot a^{(1)} \cdot (1 - a^{(1)}) δz(1)δJ=δa(1)δJ×δz(1)δa(1)=W(2)T×δ(2)a(1)(1a(1))

δ ( 1 ) = δ J δ z ( 1 ) = W ( 2 ) T × δ ( 2 ) ⋅ a ( 1 ) ⋅ ( 1 − a ( 1 ) ) \delta^{(1)} = \frac{\delta J}{\delta z^{(1)}} = W^{(2)T} \times \delta^{(2)} \cdot a^{(1)} \cdot (1 - a^{(1)}) δ(1)=δz(1)δJ=W(2)T×δ(2)a(1)(1a(1))

又因为

z ( 1 ) = W ( 1 ) × a ( 0 ) z^{(1)} = W^{(1)} \times a^{(0)} z(1)=W(1)×a(0)

所以

δ z ( 1 ) δ W ( 1 ) = a ( 0 ) \frac{\delta z^{(1)}}{\delta W^{(1)}} = a^{(0)} δW(1)δz(1)=a(0)

综上

δ J δ w ( 1 ) = δ J δ a ( 1 ) × δ a ( 1 ) δ z ( 1 ) × δ z ( 1 ) δ W ( 1 ) = δ ( 1 ) × a ( 0 ) \frac{\delta J}{\delta w^{(1)}} = \frac{\delta J}{\delta a^{(1)}} \times \frac{\delta a^{(1)}}{\delta z^{(1)}} \times \frac{\delta z^{(1)}}{\delta W^{(1)}} = \delta^{(1)} \times a^{(0)} δw(1)δJ=δa(1)δJ×δz(1)δa(1)×δW(1)δz(1)=δ(1)×a(0)

神经网络实现

import numpy as np
import matplotlib.pyplot as plt 

def init_model(X, Y, layers):
	model = {}
	nodes = []
	nodes.append(X.shape[0])
	for l in layers:
		nodes.append(l)
	nodes.append(Y.shape[0])
	model['depth'] = len(layers) + 1
	for n in range(model['depth']):
		model['W' + str(n + 1)] = np.random.rand(nodes[n+1], nodes[n])
		model['B' + str(n + 1)] = np.random.rand(nodes[n+1], 1)
	return model

def sigmoid(x):
	return 1.0 / (1 + np.exp(-x))

def forward_propagation(X, Y, model):
	result = X.copy()
	model['A0'] = result.copy()
	for i in range(model['depth']):
		W = model['W' + str(i+1)]
		B = model['B' + str(i+1)]
		result = np.dot(W, result) + B
		model['Z' + str(i+1)] = result.copy()
		result = sigmoid(result)
		model['A' + str(i+1)] = result.copy()
	return result

def cross_entropy(Y, result):
	return -Y * np.log(result) - (1 - Y) * np.log(1 - result)

def cost(Y, result):
	return np.sum(cross_entropy(Y, result)) / Y.shape[1]

def backward_propagation(Y, model, learning_rate):
	L = model['depth']
	m = Y.shape[1]
	delta = model['A' + str(L)] - Y
	model['dW' + str(L)] = np.dot(delta, model['A' + str(L-1)].T) / m
	model['dB' + str(L)] = np.sum(delta, axis = 1, keepdims=True) / m
	for l in range(L - 1, 0, -1):
		delta = (np.dot(model['W' + str(l+1)].T, delta) * model['A' + str(l)] * (1 - model['A' + str(l)])) 
		model['dW' + str(l)] = np.dot(delta, model['A' + str(l-1)].T) / m
		model['dB' + str(l)] = np.sum(delta, axis=1, keepdims=True) / m

	for l in range(model['depth'], 0, -1):
		model['W' + str(l)] -= learning_rate * model['dW' + str(l)]
		model['B' + str(l)] -= learning_rate * model['dB' + str(l)]

def train(X, Y, model, learning_rate = 0.05, times = 50, show_cost_history = True):
	cost_history = []
	result = 0
	for t in range(times):
		result = forward_propagation(X, Y, model)
		backward_propagation(Y, model, learning_rate)
		cost_history.append(cost(Y, result))
	if show_cost_history:
		plt.plot(cost_history)
		plt.show()

if __name__ == '__main__':
	m = 1000
	X = np.random.rand(2, m)
	Y = np.zeros((1, m))
	for i in range(m):
		if (X[0, i] < 0.5) and (X[1, i] < 0.5):
			Y[0, i] = 1

	model = init_model(X, Y, [2])
	train(X, Y, model)



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值