A Derivation of Backpropagation in Matrix Form(转)

本文详细介绍了如何使用矩阵形式推导反向传播算法,这是训练深度神经网络的关键步骤之一。文章首先概述了反向传播的基本原理,并通过一个包含单隐藏层的神经网络例子,展示了前向传播和反向传播的过程。此外,还讨论了权重更新的计算方法。

A Derivation of Backpropagation in Matrix Form(转)

Backpropagation is an algorithm used to train neural networks, used along with an optimization routine such as gradient descent. Gradient descent requires access to the gradient of the loss function with respect to all the weights in the network to perform a weight update, in order to minimize the loss function. Backpropagation computes these gradients in a systematic way. Backpropagation along with Gradient descent is arguably the single most important algorithm for training Deep Neural Networks and could be said to be the driving force behind the recent emergence of Deep Learning.

Any layer of a neural network can be considered as an Affine Transformation followed by application of a non linear function. A vector is received as input and is multiplied with a matrix to produce an output , to which a bias vector may be added before passing the result through an activation function such as sigmoid.

Input=xOutput=f(Wx+b)Input=xOutput=f(Wx+b)

Consider a neural network with a single hidden layer like this one. It has no bias units. We derive forward and backward pass equations in their matrix form. Neural Network

The forward propagation equations are as follows:

Input=x0Hidden Layer1 output=x1=f1(W1x0)Hidden Layer2 output=x2=f2(W2x1)Output=x3=f3(W3x2)Input=x0Hidden Layer1 output=x1=f1(W1x0)Hidden Layer2 output=x2=f2(W2x1)Output=x3=f3(W3x2)

To train this neural network, you could either use Batch gradient descent or Stochastic gradient descent. Stochastic gradient descent uses a single instance of data to perform weight updates, whereas the Batch gradient descent uses a a complete batch of data.

For simplicity lets assume this is a multiple regression problem.

Stochastic update loss function: E=12zt22E=12‖z−t‖22

Batch update loss function: E=12iBatchziti22E=12∑i∈Batch‖zi−ti‖22

Here tt is the ground truth for that instance.

We will only consider the stochastic update loss function. All the results hold for the batch version as well.

Let us look at the loss function from a different perspective. Given an input x0x0, output x3x3 is determined by W1,W2W1,W2 and W3W3. So the only tuneable parameters in EE are W1,W2W1,W2 and W3W3. To reduce the value of the error function, we have to change these weights in the negative direction of the gradient of the loss function with respect to these weights.

w=wαwEwfor all the weights ww=w−αw∂E∂wfor all the weights w

Here αwαw is a scalar for this particular weight, called the learning rate. Its value is decided by the optimization technique used. I highly recommend reading An overview of gradient descent optimization algorithms for more information about various gradient decent techniques and learning rates.

Backpropagation equations can be derived by repeatedly applying the chain rule. First we derive these for the weights in W3W3:

EW3=(x3t)x3W3=[(x3t)f3(W3x2)]W3x2W3=[(x3t)f3(W3x2)]xT2Let δ3=(x3t)f3(W3x2)EW3=δ3xT2∂E∂W3=(x3−t)∂x3∂W3=[(x3−t)∘f3′(W3x2)]∂W3x2∂W3=[(x3−t)∘f3′(W3x2)]x2TLet δ3=(x3−t)∘f3′(W3x2)∂E∂W3=δ3x2T

Here  is the Hadamard product. Lets sanity check this by looking at the dimensionalities. EW3∂E∂W3 must have the same dimensions as W3W3W3W3’s dimensions are 2×32×3. Dimensions of (x3t)(x3−t) is 2×12×1 and f3(W3x2)f3′(W3x2) is also2×12×1, so δ3δ3 is also 2×12×1x2x2 is 3×13×1, so dimensions of δ3xT2δ3x2T is 2×32×3, which is the same as W3W3.

Now for the weights in W2W2:

EW2=(x3t)x3W2=[(x3t)f3(W3x2)](W3x2)W2=δ3(W3x2)W2=WT3δ3x2W2=[WT3δ3f2(W2x1)]W2x1W2=δ2xT1∂E∂W2=(x3−t)∂x3∂W2=[(x3−t)∘f3′(W3x2)]∂(W3x2)∂W2=δ3∂(W3x2)∂W2=W3Tδ3∂x2∂W2=[W3Tδ3∘f2′(W2x1)]∂W2x1∂W2=δ2x1T

Lets sanity check this too. W2W2’s dimensions are 3×53×5δ3δ3 is 2×12×1 and W3W3 is 2×32×3, so WT3δ3W3Tδ3 is 3×13×1f2(W2x1)f2′(W2x1) is 3×13×1, so δ2δ2 is also 3×13×1x1x1 is 5×15×1, so δ2xT1δ2x1T is 3×53×5. So this checks out to be the same.

Similarly for W1W1:

EW1=[WT2δ2f1(W1x0)]xT0=δ1xT0∂E∂W1=[W2Tδ2∘f1′(W1x0)]x0T=δ1x0T

We can observe a recursive pattern emerging in the backpropagation equations. The Forward and Backward passes can be summarized as below:

The neural network has LL layers. x0x0 is the input vector, xLxL is the output vector and tt is the truth vector. The weight matrices are W1,W2,..,WLW1,W2,..,WL and activation functions are f1,f2,..,fLf1,f2,..,fL.

Forward Pass:

xi=fi(Wixi1)E=xLt22xi=fi(Wixi−1)E=‖xL−t‖22

Backward Pass:

δL=(xLt)fL(WLxL1)δi=WTi+1δi+1fi(Wixi1)δL=(xL−t)∘fL′(WLxL−1)δi=Wi+1Tδi+1∘fi′(Wixi−1)

Weight Update:

EWi=δixTi1Wi=WiαWiEWi∂E∂Wi=δixi−1TWi=Wi−αWi∘∂E∂Wi

Equations for Backpropagation, represented using matrices have two advantages.

One could easily convert these equations to code using either Numpy in Python or Matlab. It is much closer to the way neural networks are implemented in libraries. Using matrix operations speeds up the implementation as one could use high performance matrix primitives from BLAS. GPUs are also suitable for matrix computations as they are suitable for parallelization.

The matrix version of Backpropagation is intuitive to derive and easy to remember as it avoids the confusing and cluttering derivations involving summations and multiple subscripts.

内容概要:本文系统介绍了算术优化算法(AOA)的基本原理、核心思想及Python实现方法,并通过图像分割的实际案例展示了其应用价值。AOA是一种基于种群的元启发式算法,其核心思想来源于四则运算,利用乘除运算进行全局勘探,加减运算进行局部开发,通过数学优化器加速函数(MOA)和数学优化概率(MOP)动态控制搜索过程,在全局探索与局部开发之间实现平衡。文章详细解析了算法的初始化、勘探与开发阶段的更新策略,并提供了完整的Python代码实现,结合Rastrigin函数进行测试验证。进一步地,以Flask框架搭建前后端分离系统,将AOA应用于图像分割任务,展示了其在实际工程中的可行性与高效性。最后,通过收敛速度、寻优精度等指标评估算法性能,并提出自适应参数调整、模型优化和并行计算等改进策略。; 适合人群:具备一定Python编程基础和优化算法基础知识的高校学生、科研人员及工程技术人员,尤其适合从事人工智能、图像处理、智能优化等领域的从业者;; 使用场景及目标:①理解元启发式算法的设计思想与实现机制;②掌握AOA在函数优化、图像分割等实际问题中的建模与求解方法;③学习如何将优化算法集成到Web系统中实现工程化应用;④为算法性能评估与改进提供实践参考; 阅读建议:建议读者结合代码逐行调试,深入理解算法流程中MOA与MOP的作用机制,尝试在不同测试函数上运行算法以观察性能差异,并可进一步扩展图像分割模块,引入更复杂的预处理或后处理技术以提升分割效果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值