BP算法的原理解释和推导
已知的神经网络结构:
且已知的条件:
- a(j)=f(z(j))\mathbf{a}^{\left( \mathbf{j} \right)}=\mathbf{f}\left( \mathbf{z}^{\left( \mathbf{j} \right)} \right)a(j)=f(z(j))
- z(j)=W(j)a(j−1)+b(j),而θ(j)={W(j),b(j)}\mathbf{z}^{\left( \mathbf{j} \right)}=\mathbf{W}^{\left( \mathbf{j} \right)}\mathbf{a}^{\left( \mathbf{j}-1 \right)}+\mathbf{b}^{\left( \mathbf{j} \right)}\text{,而}\mathbf{\theta }^{\left( \mathbf{j} \right)}=\left\{ \mathbf{W}^{\left( \mathbf{j} \right)},\mathbf{b}^{\left( \mathbf{j} \right)} \right\}z(j)=W(j)a(j−1)+b(j),而θ(j)={W(j),b(j)}
对于上图,如果我们想得到∂l∂θ(j)\frac{\partial \mathbf{l}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}}∂θ(j)∂l,可以通过z(j)\mathbf{z}^{\left( \mathbf{j} \right)}z(j)建立l和θ(j)之间的联系,即∂l∂θ(j)=∂l∂z(j)∗∂z(j)∂θ(j)\frac{\partial \mathbf{l}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}}=\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}}∂θ(j)∂l=∂z(j)∂l∗∂θ(j)∂z(j),而l和z(j)之间的联系则可以通过z(j+1)进行建立∂l∂z(j)=∂l∂z(j+1)∗∂z(j+1)∂z(j)=∂l∂z(j+1)∗∂z(j+1)∂a(j)∗∂a(j)∂z(j)\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}=\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}=\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}*\frac{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}∂z(j)∂l=∂z(j+1)∂l∗∂z(j)∂z(j+1)=∂z(j+1)∂l∗∂a(j)∂z(j+1)∗∂z(j)∂a(j),由此,我们得到∂l∂θ(j)=∂l∂z(j+1)∗∂z(j+1)∂a(j)∗∂a(j)∂z(j)∗∂z(j)∂θ(j)\frac{\partial \mathbf{l}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}}=\frac{\partial \mathbf{l}}{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}*\frac{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}*\frac{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}{\partial \mathbf{\theta }^{\left( \mathbf{j} \right)}}∂θ(j)∂l=∂z(j+1)∂l∗∂a(j)∂z(j+1)∗∂z(j)∂a(j)∗∂θ(j)∂z(j)(链式求导法则),然后不断的迭代求导下去。
这里我们细心观察下式:
其中,∂z(j+1)∂a(j)=w(j+1)\frac{\partial \mathbf{z}^{\left( \mathbf{j}+1 \right)}}{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}=\mathbf{w}^{\left( \mathbf{j}+1 \right)}∂a(j)∂z(j+1)=w(j+1),而∂a(j)∂z(j)=f′(z(j))\frac{\partial \mathbf{a}^{\left( \mathbf{j} \right)}}{\partial \mathbf{z}^{\left( \mathbf{j} \right)}}=\mathbf{f}'\left( \mathbf{z}^{\left( \mathbf{j} \right)} \right)∂z(j)∂a(j)=f′(z(j))。然后,我们将这两个式子代入上式,得到了一个新的式子:
而∂l∂W(j)\frac{\partial \mathbf{l}}{\partial \mathbf{W}^{\left( \mathbf{j} \right)}}∂W(j)∂l和∂l∂b(j)\frac{\partial \mathbf{l}}{\partial \mathbf{b}^{\left( \mathbf{j} \right)}}∂b(j)∂l是什么样子的呢?
此时,让我们来分析一个相对复杂一些的神经网络结构的BackPropagation过程:
且已知条件:
- l=l(h)\mathbf{l}=\mathbf{l}\left( \mathbf{h} \right)l=l(h)
- h=f(w1,1(3)a1(2)+w2,1(3)a2(2)) =f(w1,1(3)f(z1(2))+w2,1(3)f(z2(2))) =f(w1,1(3)f(w1,1(2)f(z1(1)))+w2,1(3)f(w2,1(2)f(z1(1))))\mathbf{h}=\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 3 \right)}\mathbf{a}_{1}^{\left( 2 \right)}+\mathbf{w}_{2,1}^{\left( 3 \right)}\mathbf{a}_{2}^{\left( 2 \right)} \right) \\\,\, =\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 2 \right)} \right) +\mathbf{w}_{2,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{z}_{2}^{\left( 2 \right)} \right) \right) \\\,\, =\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 2 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right) +\mathbf{w}_{2,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{w}_{2,1}^{\left( 2 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right) \right)h=f(w1,1(3)a1(2)+w2,1(3)a2(2))=f(w1,1(3)f(z1(2))+w2,1(3)f(z2(2)))=f(w1,1(3)f(w1,1(2)f(z1(1)))+w2,1(3)f(w2,1(2)f(z1(1))))
此时,我们令g1(z1(1))=w1,1(3)f(w1,1(2)f(z1(1)))\mathbf{g}_1\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) =\mathbf{w}_{1,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{w}_{1,1}^{\left( 2 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right)g1(z1(1))=w1,1(3)f(w1,1(2)f(z1(1)))和g2(z1(1))=w2,1(3)f(w2,1(2)f(z1(1)))\mathbf{g}_2\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) =\mathbf{w}_{2,1}^{\left( 3 \right)}\mathbf{f}\left( \mathbf{w}_{2,1}^{\left( 2 \right)}\mathbf{f}\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right)g2(z1(1))=w2,1(3)f(w2,1(2)f(z1(1))),然后我们将上面h的表达式进行转换:
- h=f(g1(z1(1))+g2(z1(1))) \mathbf{h}=\mathbf{f}\left( \mathbf{g}_1\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) +\mathbf{g}_2\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \right) h=f(g1(z1(1))+g2(z1(1)))
然后,我们求解∂h∂z1(1)\frac{\partial \mathbf{h}}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}∂z1(1)∂h,来接着分析化简:
- ∂h∂z1(1)=∂h∂g1∗∂g1∂z1(1)+∂h∂g2∗∂g2∂z1(1) =∂g1∂z1(2)w1,1(2)f′(z1(1))+∂g2∂z2(2)w2,1(2)f′(z1(1)) =[∂g1∂z1(2)w1,1(2)+∂g2∂z2(2)w2,1(2)]f′(z1(1))\frac{\partial \mathbf{h}}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}=\frac{\partial \mathbf{h}}{\partial \mathbf{g}_1}*\frac{\partial \mathbf{g}_1}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}+\frac{\partial \mathbf{h}}{\partial \mathbf{g}_2}*\frac{\partial \mathbf{g}_2}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}\\\,\, =\frac{\partial \mathbf{g}_1}{\partial \mathbf{z}_{1}^{\left( 2 \right)}}\mathbf{w}_{1,1}^{\left( 2 \right)}\mathbf{f}'\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) +\frac{\partial \mathbf{g}_2}{\partial \mathbf{z}_{2}^{\left( 2 \right)}}\mathbf{w}_{2,1}^{\left( 2 \right)}\mathbf{f}'\left( \mathbf{z}_{1}^{\left( 1 \right)} \right) \\\,\, =\left[ \frac{\partial \mathbf{g}_1}{\partial \mathbf{z}_{1}^{\left( 2 \right)}}\mathbf{w}_{1,1}^{\left( 2 \right)}+\frac{\partial \mathbf{g}_2}{\partial \mathbf{z}_{2}^{\left( 2 \right)}}\mathbf{w}_{2,1}^{\left( 2 \right)} \right] \mathbf{f}'\left( \mathbf{z}_{1}^{\left( 1 \right)} \right)∂z1(1)∂h=∂g1∂h∗∂z1(1)∂g1+∂g2∂h∗∂z1(1)∂g2=∂z1(2)∂g1w1,1(2)f′(z1(1))+∂z2(2)∂g2w2,1(2)f′(z1(1))=[∂z1(2)∂g1w1,1(2)+∂z2(2)∂g2w2,1(2)]f′(z1(1))
- 进而得到迭代关系:δ1(1)=[δ1(2)w1,1(2)+δ2(2)w2,1(2)]f′(z1(1))\mathbf{\delta }_{1}^{\left( 1 \right)}=\left[ \mathbf{\delta }_{1}^{\left( 2 \right)}\mathbf{w}_{1,1}^{\left( 2 \right)}+\mathbf{\delta }_{2}^{\left( 2 \right)}\mathbf{w}_{2,1}^{\left( 2 \right)} \right] \mathbf{f}'\left( \mathbf{z}_{1}^{\left( 1 \right)} \right)δ1(1)=[δ1(2)w1,1(2)+δ2(2)w2,1(2)]f′(z1(1))
最后我们便通过上式得到∂h∂w1(1)\frac{\partial \mathbf{h}}{\partial \mathbf{w}_{1}^{\left( 1 \right)}}∂w1(1)∂h和∂h∂b1(1)\frac{\partial \mathbf{h}}{\partial \mathbf{b}_{1}^{\left( 1 \right)}}∂b1(1)∂h,过程如下:
- ∂h∂w1(1)=∂h∂z1(1)∂z1(1)∂w1(1)=δ1(1)a(0)=δ1(1)x1\frac{\partial \mathbf{h}}{\partial \mathbf{w}_{1}^{\left( 1 \right)}}=\frac{\partial \mathbf{h}}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}\frac{\partial \mathbf{z}_{1}^{\left( 1 \right)}}{\partial \mathbf{w}_{1}^{\left( 1 \right)}}=\mathbf{\delta }_{1}^{\left( 1 \right)}\mathbf{a}^{\left( 0 \right)}=\mathbf{\delta }_{1}^{\left( 1 \right)}\mathbf{x}_1∂w1(1)∂h=∂z1(1)∂h∂w1(1)∂z1(1)=δ1(1)a(0)=δ1(1)x1
- ∂h∂b1(1)=∂h∂z1(1)∂z1(1)∂b1(1)=δ1(1)\frac{\partial \mathbf{h}}{\partial \mathbf{b}_{1}^{\left( 1 \right)}}=\frac{\partial \mathbf{h}}{\partial \mathbf{z}_{1}^{\left( 1 \right)}}\frac{\partial \mathbf{z}_{1}^{\left( 1 \right)}}{\partial \mathbf{b}_{1}^{\left( 1 \right)}}=\mathbf{\delta }_{1}^{\left( 1 \right)}∂b1(1)∂h=∂z1(1)∂h∂b1(1)∂z1(1)=δ1(1)
通过归纳δ(j)\mathbf{\delta }^{\left( \mathbf{j} \right)}δ(j)和δ(j+1)\mathbf{\delta }^{\left( \mathbf{j}+1 \right)}δ(j+1)之间的关系,我们得到了 一个特别重要也是最重要的BP公式 :
- δ(j)=f′(zi(j))∗[∑k=1Nj+1wk,l(j+1)δk(j+1)]\mathbf{\delta }^{\left( \mathbf{j} \right)}=\mathbf{f}'\left( \mathbf{z}_{\mathbf{i}}^{\left( \mathbf{j} \right)} \right) *\left[ \sum_{\mathbf{k}=1}^{\mathbf{N}_{\mathbf{j}+1}}{\mathbf{w}_{\mathbf{k},\mathbf{l}}^{\left( \mathbf{j}+1 \right)}\mathbf{\delta }_{\mathbf{k}}^{\left( \mathbf{j}+1 \right)}} \right]δ(j)=f′(zi(j))∗[∑k=1Nj+1wk,l(j+1)δk(j+1)]
如图所示:
其中wk,l(j+1)\mathbf{w}_{\mathbf{k},\mathbf{l}}^{\left( \mathbf{j}+1 \right)}wk,l(j+1)由记录值直接代入即可,δk(j+1)\mathbf{\delta }_{\mathbf{k}}^{\left( \mathbf{j}+1 \right)}δk(j+1)是由δk(j+2)\mathbf{\delta }_{\mathbf{k}}^{\left( \mathbf{j}+2 \right)}δk(j+2)反向传播得到的,而f′(zi(j))\mathbf{f}'\left( \mathbf{z}_{\mathbf{i}}^{\left( \mathbf{j} \right)} \right)f′(zi(j))是由第j层的激活函数的导数公式代入zi(j)\mathbf{z}_{\mathbf{i}}^{\left( \mathbf{j} \right)}zi(j)计算得到的,以下是常见的几种激活函数以及它们的导数公式:
但是我们问什么要使用BP算法呢?
解释:
- 因为如果没有BP算法,那么我们在计算某一个层的梯度的时候,就需要遍历在它所有的层进行梯度的链式计算,每一个位置的神经元的参数梯度计算都是如此,计算量爆炸!
- 但是,当我们拥有了BP算法,我们只要从后逐层计算每个位置神经元的参数梯度δ(j+1)\mathbf{\delta }^{\left( \mathbf{j+1} \right)}δ(j+1)即可,然后并保存该层所计算出的参数梯度δ(j+1)\mathbf{\delta }^{\left( \mathbf{j+1} \right)}δ(j+1),然后接着往前计算出前一层的δ(j)\mathbf{\delta }^{\left( \mathbf{j} \right)}δ(j),依次迭代计算。
- BP算法的本质是动态规划,核心思想是“之前计算过的结果保存下来,下次计算接着拿出来用,并且发现它们之间的迭代关系,然后大大节省了计算开销。”