- 在本文中,我们主要介绍softmax,softmax+crossentropy,三层全连接的导数计算和反向传播
softmax
- 定义:S(ai)=eai∑j=1NeajS(a_i) = \frac{e^{a_i}}{\sum_{j=1}^N{e^{a_j}}}S(ai)=∑j=1Neajeai
- 倒数计算过程(令SiS_iSi表示S(ai)S(a_i)S(ai)):
if i==k:∂Si∂ak=eai∑j=1Neaj+eai∗−1∗eai(∑j=1Neaj)2=eai∑j=1Neaj∗(1−eai∑j=1Neaj)=Si(1−Si) if i!=k:∂Si∂ak=eai∗−1∗eak(∑j=1Neaj)2=−Si∗Sk if\ i==k:\\ \frac{\partial{S_i}}{\partial{a_k}}\\ =\frac{e^{a_i}}{\sum_{j=1}^N{e^{a_j}}} + \frac{e^{a_i} * -1 * e^{a_i}}{(\sum_{j=1}^N{e^{a_j}})^2} \\ = \frac{e^{a_i}}{\sum_{j=1}^N{e^{a_j}}} * (1-\frac{e^{a_i}}{\sum_{j=1}^N{e^{a_j}}}) \\ = S_i(1-S_i) \\ \ \\ \ \\ if\ i != k:\\ \frac{\partial{S_i}}{\partial{a_k}} \\ = \frac{e^{a_i} * -1*e^{a_k}}{(\sum_{j=1}^N{e^{a_j}})^2} \\ =-S_i*S_k if i==k:∂ak∂Si=∑j=1Neajeai+(∑j=1Neaj)2eai∗−1∗eai=∑j=1Neajeai∗(1−∑j=1Neajeai)=Si(1−Si) if i!=k:∂ak∂Si=(∑j=1Neaj)2eai∗−1∗eak=−Si∗Sk
Softmax+CrossEntropy
- 定义:Lce=∑i=1N−yi∗log(Si)L_{ce} = \sum_{i=1}^N{-y_i * log(S_i)}Lce=∑i=1N−yi∗log(Si)
- 导数计算过程:
∂Lce∂ak=∑i=1N−yi∗∂log(Si)∂ak=−yi∗1Si∗Si∗(1−Si)−∑i!=kyi∗1Si∗(−Si∗Sk)=−yi∗(1−Si)+∑i!=kyi∗Sk=−yi+∑i=1Nyi∗Sk=Sk−yi \frac{\partial{L_{ce}}}{\partial{a_k}} = \sum_{i=1}^N{-y_i * \frac{\partial{log(S_i)}}{\partial a_k}}\\ = -y_i*\frac{1}{S_i}*S_i*(1-S_i) - \sum_{i!=k}{y_i*\frac{1}{S_i}*(-S_i*S_k)}\\ =-y_i*(1-S_i) + \sum_{i!=k}{y_i * S_k} \\ =-y_i + \sum_{i=1}^N{y_i * S_k} \\ = S_k - y_i ∂ak∂Lce=i=1∑N−yi∗∂ak∂log(Si)=−yi∗Si1∗Si∗(1−Si)−i!=k∑yi∗Si1∗(−Si∗Sk)=−yi∗(1−Si)+i!=k∑yi∗Sk=−yi+i=1∑Nyi∗Sk=Sk−yi - 物理意义:softmax 和 cross entropy组合,对softmax前的第i维度的导数就是对应softmax后第i维度和ground truth 的差值
三层全联接求导
- 网络结构如下所示
- 我们首先定义如下符合
- L=12∗∣y−z∣2L=\frac{1}{2} * |y-z|^2L=21∗∣y−z∣2
- zk=f(∑j=1hwkjyj)z_k=f(\sum_{j=1}^h{w_{kj}y_j})zk=f(∑j=1hwkjyj), f is the non-linear function, such as softmax
- netk=∑j=1hwkjyjnet_k = \sum_{j=1}^h{w_{kj}y_j}netk=∑j=1hwkjyj
- yj=f(∑i=1dwjixi)y_j = f(\sum_{i=1}^d{w_{ji}x_i})yj=f(∑i=1dwjixi)
- netj=∑i=1dwjixinet_j = \sum_{i=1}^d{w_{ji}x_i}netj=∑i=1dwjixi
- xix_ixi represents the input value on the i-th dimension.
- 接下来我们分别计算对wkjandwjiw_{kj} and w_{ji}wkjandwji的导数
- derivation for wkjw_{kj}wkj
∂L∂wkj=(yk−zk)∗−1∗∂zk∂wkj=−1∗(yk−zk)∗∂zk∂netk∗∂netk∂wkj=−1∗(yk−zk)∗f′(netk)∗yj∴wkj=wkj−αΔwkj=wkj+(yk−zk)∗f′(netk)∗yj \frac{\partial{L}}{\partial{w_{kj}}}=(y_k-z_k)*-1*\frac{\partial{z_k}}{\partial{w_{kj}}}\\ =-1*(y_k-z_k)*\frac{\partial{z_k}}{\partial{net_k}}*\frac{\partial{net_k}}{\partial{w_{kj}}}\\ =-1*(y_k-z_k)*f'(net_k)*y_j\\ \therefore w_{kj} = w_{kj} - \alpha\Delta{w_{kj}} \\ = w_{kj} + (y_k-z_k)*f'(net_k)*y_j ∂wkj∂L=(yk−zk)∗−1∗∂wkj∂zk=−1∗(yk−zk)∗∂netk∂zk∗∂wkj∂netk=−1∗(yk−zk)∗f′(netk)∗yj∴wkj=wkj−αΔwkj=wkj+(yk−zk)∗f′(netk)∗yj - derivation for wjiw_jiwji
∂L∂wji=∂L∂yj∗∂yj∂netj∗∂netj∂wji∂L∂yj=∑k=1h∂L∂netk∗∂netk∂yj=∑k=1h−(yk−zk)∗f′(netk)∗wkj∴∂L∂wji=∑k=1h−(yk−zk)∗f′(netk)∗wkj∗f′(netj)∗xi \frac{\partial{L}}{\partial{w_{ji}}}=\frac{\partial{L}}{\partial{y_j}}*\frac{\partial{y_j}}{\partial{net_j}}*\frac{\partial{net_j}}{\partial{w_{ji}}}\\ \frac{\partial{L}}{\partial{y_j}}=\sum_{k=1}^h{\frac{\partial{L}}{\partial{net_k}}}*\frac{\partial{net_k}}{\partial{y_j}} \\ =\sum_{k=1}^h{-(y_k-z_k)*f'(net_k)*w_{kj}} \\ \therefore \\ \frac{\partial{L}}{\partial{w_{ji}}}=\sum_{k=1}^h{-(y_k-z_k)*f'(net_k)*w_{kj}} * f'(net_j)*x_i ∂wji∂L=∂yj∂L∗∂netj∂yj∗∂wji∂netj∂yj∂L=k=1∑h∂netk∂L∗∂yj∂netk=k=1∑h−(yk−zk)∗f′(netk)∗wkj∴∂wji∂L=k=1∑h−(yk−zk)∗f′(netk)∗wkj∗f′(netj)∗xi - 至此, 我们已经计算了这两层里面所有的参数的梯度,然后我们就可以通过反向传播去计算梯度啦。
- 至于f′(netk)f'(net_k)f′(netk)的计算参照softmax梯度的计算方法。f′(netk)=netk∗(1−netk)f'(net_k) = net_k*(1-net_k)f′(netk)=netk∗(1−netk)
- derivation for wkjw_{kj}wkj