谁敢相信我马上要秋招了且有两段算法实习到现在才算真的理解透彻反向传播,这世界就是个巨大的ctbz…
首先理解链式法则,假设有两个可微的函数f(x)f(x)f(x)和g(x)g(x)g(x),h(x)=f(g(x))h(x)=f(g(x))h(x)=f(g(x)),记u=g(x)u=g(x)u=g(x),f(u)=h(x)f(u)=h(x)f(u)=h(x),则∂h(x)∂x=∂f(u)∂u∂g(x)∂x\frac{\partial h(x)}{\partial x}=\frac{\partial f(u)}{\partial u}\frac{\partial g(x)}{\partial x}∂x∂h(x)=∂u∂f(u)∂x∂g(x).
复合函数的导数可以逐步分解求导再相乘,而神经网络里的基本单元就是线性层+激活函数,假设输入x有两层网络:z1=W1x+b1,a1=σ(z1)z^1=W^1x+b^1,a^1=\sigma(z^1)z1=W1x+b1,a1=σ(z1)
z2=W2a1+b2,a2=σ(z2)z^2=W^2a^1+b^2,a^2=\sigma(z^2)z2=W2a1+b2,a2=σ(z2)
最终输出ypred=a2y^{pred}=a^2ypred=a2,定义损失函数为MSE,L=12(ypred−y)2L=\frac12(y^{pred}-y)^2L=21(ypred−y)2,yyy是label,有∂L∂ypred=ypred−y\frac{\partial L}{\partial y^{pred}}=y^{pred}-y∂ypred∂L=ypred−y,
初始随机化参数,梯度下降更新参数值,有W2=W2−α∂L∂W2W^2=W^2-\alpha\frac{\partial L}{\partial W^2}W2=W2−α∂W2∂L,∂L∂W2=∂L∂ypred∂ypred∂z2∂z2∂W2=(ypred−y)σ′(z2)a1\frac{\partial L}{\partial W^2}=\frac{\partial L}{\partial y^{pred}}\frac{\partial y^{pred}}{\partial z^2}\frac{\partial z^2}{\partial W^2}=(y^{pred}-y)\sigma'(z^2)a^1∂W2∂L=∂ypred∂L∂z2∂ypred∂W2∂z2=(ypred−y)σ′(z2)a1,依次更新反向传播。
∂L∂b2=∂L∂ypred∂ypred∂z2∂z2∂b2=(ypred−y)σ′(z2)\frac{\partial L}{\partial b^2}=\frac{\partial L}{\partial y^{pred}}\frac{\partial y^{pred}}{\partial z^2}\frac{\partial z^2}{\partial b^2}=(y^{pred}-y)\sigma'(z^2)∂b2∂L=∂ypred∂L∂z2∂ypred∂b2∂z2=(ypred−y)σ′(z2)
∂L∂b1=∂L∂ypred∂ypred∂z2∂z2∂a1∂a1∂z1∂z1∂b1=(ypred−y)σ′(z2)W2σ′(z1)\frac{\partial L}{\partial b^1}=\frac{\partial L}{\partial y^{pred}}\frac{\partial y^{pred}}{\partial z^2}\frac{\partial z^2}{\partial a^1}\frac{\partial a^1}{\partial z^1}\frac{\partial z^1}{\partial b^1}=(y^{pred}-y)\sigma'(z^2)W^2\sigma'(z^1)∂b1∂L=∂ypred∂L∂z2∂ypred∂a1∂z2∂z1∂a1∂b1∂z1=(ypred−y)σ′(z2)W2σ′(z1)
∂L∂W1=∂L∂ypred∂ypred∂z2∂z2∂a1∂a1∂z1∂z1∂W1=(ypred−y)σ′(z2)W2σ′(z1)x\frac{\partial L}{\partial W^1}=\frac{\partial L}{\partial y^{pred}}\frac{\partial y^{pred}}{\partial z^2}\frac{\partial z^2}{\partial a^1}\frac{\partial a^1}{\partial z^1}\frac{\partial z^1}{\partial W^1}=(y^{pred}-y)\sigma'(z^2)W^2\sigma'(z^1)x∂W1∂L=∂ypred∂L∂z2∂ypred∂a1∂z2∂z1∂a1∂W1∂z1=(ypred−y)σ′(z2)W2σ′(z1)x
实现代码
import numpy as np
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
# Initialize weights
self.weights_input_hidden = np.random.randn(self.input_size, self.hidden_size)
self.weights_hidden_output = np.random.randn(self.hidden_size, self.output_size)
# Initialize the biases
self.bias_hidden = np.zeros((1, self.hidden_size))
self.bias_output = np.zeros((1, self.output_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(self, x):
return x * (1 - x)
def feedforward(self, X):
# Input to hidden
self.hidden_activation = np.dot(X, self.weights_input_hidden) + self.bias_hidden
self.hidden_output = self.sigmoid(self.hidden_activation)
# Hidden to output
self.output_activation = np.dot(self.hidden_output, self.weights_hidden_output) + self.bias_output
self.predicted_output = self.sigmoid(self.output_activation)
return self.predicted_output
def backward(self, X, y, learning_rate):
# Compute the output layer error
output_error = y - self.predicted_output
output_delta = output_error * self.sigmoid_derivative(self.predicted_output)
# Compute the hidden layer error
hidden_error = np.dot(output_delta, self.weights_hidden_output.T)
hidden_delta = hidden_error * self.sigmoid_derivative(self.hidden_output)
# Update weights and biases
self.weights_hidden_output += np.dot(self.hidden_output.T, output_delta) * learning_rate
self.bias_output += np.sum(output_delta, axis=0, keepdims=True) * learning_rate
self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate
self.bias_hidden += np.sum(hidden_delta, axis=0, keepdims=True) * learning_rate
def train(self, X, y, epochs, learning_rate):
for epoch in range(epochs):
output = self.feedforward(X)
self.backward(X, y, learning_rate)
if epoch % 4000 == 0:
loss = np.mean(np.square(y - output))
print("Epoch{epoch}, Loss:{loss}")
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=10000, learning_rate=0.1)
# Test the trained model
output = nn.feedforward(X)
print(output)
函数求导的转置变换
默认为列向量,输入为 x∈Rdinx \in \mathbb{R}^{d_{\text{in}}}x∈Rdin,第一层权重 W1∈Rd1×dinW^1 \in \mathbb{R}^{d_{1} \times d_{\text{in}}}W1∈Rd1×din,第二层权重 W2∈Rd2×d1W^2 \in \mathbb{R}^{d_{2} \times d_{1}}W2∈Rd2×d1,偏置 b1∈Rd1b^1 \in \mathbb{R}^{d_{1}}b1∈Rd1 和 b2∈Rd2b^2 \in \mathbb{R}^{d_{2}}b2∈Rd2。非线性层逐元素乘,线性层矩阵乘。
前向传播:
z1=W1x+b1,a1=σ(z1)z^1=W^1x+b^1,a^1=\sigma(z^1)z1=W1x+b1,a1=σ(z1)
z2=W2a1+b2,a2=σ(z2)z^2=W^2a^1+b^2,a^2=\sigma(z^2)z2=W2a1+b2,a2=σ(z2)
有z1∈Rd1z^1 \in \mathbb{R}^{d_{1}}z1∈Rd1,a1∈Rd1a^1 \in \mathbb{R}^{d_{1}}a1∈Rd1,z2∈Rd2z^2 \in \mathbb{R}^{d_{2}}z2∈Rd2,a2∈Rd2a^2 \in \mathbb{R}^{d_{2}}a2∈Rd2。
反向传播有ypred=a2y^{pred}=a^2ypred=a2,定义损失函数为MSE,L=12(ypred−y)2L=\frac12(y^{pred}-y)^2L=21(ypred−y)2,yyy是label,有∂L∂ypred=ypred−y\frac{\partial L}{\partial y^{pred}}=y^{pred}-y∂ypred∂L=ypred−y,W2=W2−α∂L∂W2W^2=W^2-\alpha\frac{\partial L}{\partial W^2}W2=W2−α∂W2∂L,维度Rd2×d1\mathbb{R}^{d_{2} \times d_{1}}Rd2×d1。
∂L∂W2=∂L∂ypred∂ypred∂z2∂z2∂W2=[(ypred−y)⊙σ′(z2)](a1)T∈Rd2×1×R1×d1\begin{align*}
\frac{\partial L}{\partial W^2}&=\frac{\partial L}{\partial y^{pred}}\frac{\partial y^{pred}}{\partial z^2}\frac{\partial z^2}{\partial W^2}\\
&=\left[(y^{pred}-y) \odot \sigma'(z^2)\right] (a^1)^T \in \mathbb{R}^{d_{2}\times 1}\times \mathbb{R}^{1 \times d_{1}}\end{align*}∂W2∂L=∂ypred∂L∂z2∂ypred∂W2∂z2=[(ypred−y)⊙σ′(z2)](a1)T∈Rd2×1×R1×d1
∂L∂b2=∂L∂ypred∂ypred∂z2∂z2∂b2=[(ypred−y)⊙σ′(z2)]∈Rd2\begin{align*} \frac{\partial L}{\partial b^2}&=\frac{\partial L}{\partial y^{pred}}\frac{\partial y^{pred}}{\partial z^2}\frac{\partial z^2}{\partial b^2}\\ &=\left[(y^{pred}-y) \odot \sigma'(z^2)\right] \in \mathbb{R}^{d_{2}}\end{align*}∂b2∂L=∂ypred∂L∂z2∂ypred∂b2∂z2=[(ypred−y)⊙σ′(z2)]∈Rd2
∂L∂b1=∂L∂ypred∂ypred∂z2∂z2∂a1∂a1∂z1∂z1∂b1=[(W2)T[(ypred−y)⊙σ′(z2)]]⊙σ′(z1)∈Rd1×d2×Rd2×1 \begin{align*} \frac{\partial L}{\partial b^1}&=\frac{\partial L}{\partial y^{pred}}\frac{\partial y^{pred}}{\partial z^2}\frac{\partial z^2}{\partial a^1}\frac{\partial a^1}{\partial z^1}\frac{\partial z^1}{\partial b^1}\\ &=[{(W^2)}^T \left[(y^{pred}-y) \odot \sigma'(z^2)\right]] \odot \sigma'(z^1) \\ &\in \mathbb{R}^{d_{1} \times d_{2}} \times \mathbb{R}^{d_{2}\times 1} \end{align*}∂b1∂L=∂ypred∂L∂z2∂ypred∂a1∂z2∂z1∂a1∂b1∂z1=[(W2)T[(ypred−y)⊙σ′(z2)]]⊙σ′(z1)∈Rd1×d2×Rd2×1
∂L∂W1=∂L∂ypred∂ypred∂z2∂z2∂a1∂a1∂z1∂z1∂W1=[(W2)T[(ypred−y)⊙σ′(z2)]]⊙σ′(z1)xT∈Rd1×d2×Rd2×1×R1×din \begin{align*} \frac{\partial L}{\partial W^1} &= \frac{\partial L}{\partial y^{pred}}\frac{\partial y^{pred}}{\partial z^2}\frac{\partial z^2}{\partial a^1}\frac{\partial a^1}{\partial z^1}\frac{\partial z^1}{\partial W^1} \\ &= [{(W^2)}^T \left[(y^{pred}-y) \odot \sigma'(z^2)\right] ]\odot \sigma'(z^1) x^T \\ &\in \mathbb{R}^{d_{1} \times d_{2}} \times \mathbb{R}^{d_{2}\times 1} \times \mathbb{R}^{1 \times d_{in}} \end{align*} ∂W1∂L=∂ypred∂L∂z2∂ypred∂a1∂z2∂z1∂a1∂W1∂z1=[(W2)T[(ypred−y)⊙σ′(z2)]]⊙σ′(z1)xT∈Rd1×d2×Rd2×1×R1×din