- 在本文中,我们主要介绍softmax,softmax+crossentropy,三层全连接的导数计算和反向传播
softmax
- 定义: S ( a i ) = e a i ∑ j = 1 N e a j S(a_i) = \frac{e^{a_i}}{\sum_{j=1}^N{e^{a_j}}} S(ai)=∑j=1Neajeai
- 倒数计算过程(令
S
i
S_i
Si表示
S
(
a
i
)
S(a_i)
S(ai)):
i f i = = k : ∂ S i ∂ a k = e a i ∑ j = 1 N e a j + e a i ∗ − 1 ∗ e a i ( ∑ j = 1 N e a j ) 2 = e a i ∑ j = 1 N e a j ∗ ( 1 − e a i ∑ j = 1 N e a j ) = S i ( 1 − S i ) i f i ! = k : ∂ S i ∂ a k = e a i ∗ − 1 ∗ e a k ( ∑ j = 1 N e a j ) 2 = − S i ∗ S k if\ i==k:\\ \frac{\partial{S_i}}{\partial{a_k}}\\ =\frac{e^{a_i}}{\sum_{j=1}^N{e^{a_j}}} + \frac{e^{a_i} * -1 * e^{a_i}}{(\sum_{j=1}^N{e^{a_j}})^2} \\ = \frac{e^{a_i}}{\sum_{j=1}^N{e^{a_j}}} * (1-\frac{e^{a_i}}{\sum_{j=1}^N{e^{a_j}}}) \\ = S_i(1-S_i) \\ \ \\ \ \\ if\ i != k:\\ \frac{\partial{S_i}}{\partial{a_k}} \\ = \frac{e^{a_i} * -1*e^{a_k}}{(\sum_{j=1}^N{e^{a_j}})^2} \\ =-S_i*S_k if i==k:∂ak∂Si=∑j=1Neajeai+(∑j=1Neaj)2eai∗−1∗eai=∑j=1Neajeai∗(1−∑j=1Neajeai)=Si(1−Si) if i!=k:∂ak∂Si=(∑j=1Neaj)2eai∗−1∗eak=−Si∗Sk
Softmax+CrossEntropy
- 定义: L c e = ∑ i = 1 N − y i ∗ l o g ( S i ) L_{ce} = \sum_{i=1}^N{-y_i * log(S_i)} Lce=∑i=1N−yi∗log(Si)
- 导数计算过程:
∂ L c e ∂ a k = ∑ i = 1 N − y i ∗ ∂ l o g ( S i ) ∂ a k = − y i ∗ 1 S i ∗ S i ∗ ( 1 − S i ) − ∑ i ! = k y i ∗ 1 S i ∗ ( − S i ∗ S k ) = − y i ∗ ( 1 − S i ) + ∑ i ! = k y i ∗ S k = − y i + ∑ i = 1 N y i ∗ S k = S k − y i \frac{\partial{L_{ce}}}{\partial{a_k}} = \sum_{i=1}^N{-y_i * \frac{\partial{log(S_i)}}{\partial a_k}}\\ = -y_i*\frac{1}{S_i}*S_i*(1-S_i) - \sum_{i!=k}{y_i*\frac{1}{S_i}*(-S_i*S_k)}\\ =-y_i*(1-S_i) + \sum_{i!=k}{y_i * S_k} \\ =-y_i + \sum_{i=1}^N{y_i * S_k} \\ = S_k - y_i ∂ak∂Lce=i=1∑N−yi∗∂ak∂log(Si)=−yi∗Si1∗Si∗(1−Si)−i!=k∑yi∗Si1∗(−Si∗Sk)=−yi∗(1−Si)+i!=k∑yi∗Sk=−yi+i=1∑Nyi∗Sk=Sk−yi - 物理意义:softmax 和 cross entropy组合,对softmax前的第i维度的导数就是对应softmax后第i维度和ground truth 的差值
三层全联接求导
- 网络结构如下所示
- 我们首先定义如下符合
- L = 1 2 ∗ ∣ y − z ∣ 2 L=\frac{1}{2} * |y-z|^2 L=21∗∣y−z∣2
- z k = f ( ∑ j = 1 h w k j y j ) z_k=f(\sum_{j=1}^h{w_{kj}y_j}) zk=f(∑j=1hwkjyj), f is the non-linear function, such as softmax
- n e t k = ∑ j = 1 h w k j y j net_k = \sum_{j=1}^h{w_{kj}y_j} netk=∑j=1hwkjyj
- y j = f ( ∑ i = 1 d w j i x i ) y_j = f(\sum_{i=1}^d{w_{ji}x_i}) yj=f(∑i=1dwjixi)
- n e t j = ∑ i = 1 d w j i x i net_j = \sum_{i=1}^d{w_{ji}x_i} netj=∑i=1dwjixi
- x i x_i xi represents the input value on the i-th dimension.
- 接下来我们分别计算对
w
k
j
a
n
d
w
j
i
w_{kj} and w_{ji}
wkjandwji的导数
- derivation for
w
k
j
w_{kj}
wkj
∂ L ∂ w k j = ( y k − z k ) ∗ − 1 ∗ ∂ z k ∂ w k j = − 1 ∗ ( y k − z k ) ∗ ∂ z k ∂ n e t k ∗ ∂ n e t k ∂ w k j = − 1 ∗ ( y k − z k ) ∗ f ′ ( n e t k ) ∗ y j ∴ w k j = w k j − α Δ w k j = w k j + ( y k − z k ) ∗ f ′ ( n e t k ) ∗ y j \frac{\partial{L}}{\partial{w_{kj}}}=(y_k-z_k)*-1*\frac{\partial{z_k}}{\partial{w_{kj}}}\\ =-1*(y_k-z_k)*\frac{\partial{z_k}}{\partial{net_k}}*\frac{\partial{net_k}}{\partial{w_{kj}}}\\ =-1*(y_k-z_k)*f'(net_k)*y_j\\ \therefore w_{kj} = w_{kj} - \alpha\Delta{w_{kj}} \\ = w_{kj} + (y_k-z_k)*f'(net_k)*y_j ∂wkj∂L=(yk−zk)∗−1∗∂wkj∂zk=−1∗(yk−zk)∗∂netk∂zk∗∂wkj∂netk=−1∗(yk−zk)∗f′(netk)∗yj∴wkj=wkj−αΔwkj=wkj+(yk−zk)∗f′(netk)∗yj - derivation for
w
j
i
w_ji
wji
∂ L ∂ w j i = ∂ L ∂ y j ∗ ∂ y j ∂ n e t j ∗ ∂ n e t j ∂ w j i ∂ L ∂ y j = ∑ k = 1 h ∂ L ∂ n e t k ∗ ∂ n e t k ∂ y j = ∑ k = 1 h − ( y k − z k ) ∗ f ′ ( n e t k ) ∗ w k j ∴ ∂ L ∂ w j i = ∑ k = 1 h − ( y k − z k ) ∗ f ′ ( n e t k ) ∗ w k j ∗ f ′ ( n e t j ) ∗ x i \frac{\partial{L}}{\partial{w_{ji}}}=\frac{\partial{L}}{\partial{y_j}}*\frac{\partial{y_j}}{\partial{net_j}}*\frac{\partial{net_j}}{\partial{w_{ji}}}\\ \frac{\partial{L}}{\partial{y_j}}=\sum_{k=1}^h{\frac{\partial{L}}{\partial{net_k}}}*\frac{\partial{net_k}}{\partial{y_j}} \\ =\sum_{k=1}^h{-(y_k-z_k)*f'(net_k)*w_{kj}} \\ \therefore \\ \frac{\partial{L}}{\partial{w_{ji}}}=\sum_{k=1}^h{-(y_k-z_k)*f'(net_k)*w_{kj}} * f'(net_j)*x_i ∂wji∂L=∂yj∂L∗∂netj∂yj∗∂wji∂netj∂yj∂L=k=1∑h∂netk∂L∗∂yj∂netk=k=1∑h−(yk−zk)∗f′(netk)∗wkj∴∂wji∂L=k=1∑h−(yk−zk)∗f′(netk)∗wkj∗f′(netj)∗xi - 至此, 我们已经计算了这两层里面所有的参数的梯度,然后我们就可以通过反向传播去计算梯度啦。
- 至于 f ′ ( n e t k ) f'(net_k) f′(netk)的计算参照softmax梯度的计算方法。 f ′ ( n e t k ) = n e t k ∗ ( 1 − n e t k ) f'(net_k) = net_k*(1-net_k) f′(netk)=netk∗(1−netk)
- derivation for
w
k
j
w_{kj}
wkj