来自博客
Let’s see a very simple handwriting formula derivation
Define
Firstly, let define some variables and operations
Gradient of the variable in layer L(last layer)
dWL = dLoss * aL
Gradient of the variable in layer L-1
dW(L-1) = dLoss * WL * dF(L-1) * a(L-1)
Gradient of the variable in layer L-2
dW(L-2) = dLoss * WL * dF(L-1) * a(L-1) * W(L-1) * dF(L-2) * a(L-2)
Summary
So, as we can see, the gradient of any training variables only depends on the variable itself(W), the derivative of activation function(dF), and the activated value(a).
Relations with gradient vanishing or exploding
Gradient exploding
Training variables are larger than 1, or the derivative of activation function are larger than 1, or the nd the activated value are larger than 1.
Gradient vanishing
Training variables are smaller than 1, or the derivative of activation function are smaller than 1, or the nd the activated value are smaller than 1.
To prevent graident vanishing or exploding
From the view of training variables
To limit the traning variables into a proper range. We shoudl use a good variable initialization technic, such as xavier initialization.
From the view of derivative of activation function
To limit derivative of activation function to a proper range, we should use non-saturated activation function as activation instead of sigmoid
From the view of activated value
To limit the activation value in to proper range, we should use batchnorm to make the activated value into a zero centered and variance to one.
From the view of model structure
To future enhance the gradient to the shallow layer, we should use residual block to construct our network.