文章目录
Setting up your Machine Learning Application
Train/Dev/Test Sets
- Train set : Train models
- Development set : Adjust hyperparameters, Evaluate different algorithms
- Test set : Calculate the accuracy of the model
- Make sure dev and test set come from the same distribution.
Bais/Variance
- Basic Recipe
Regularizing Neural Network
L2-Regularization
J ( w 1 , b 1 , w 2 , b 2 , . . . , w l , b l ) = ∑ i = 1 m L ( y i , y i ^ ) + λ 2 m ∑ l = 1 l ∥ w l ∥ 2 2 J(w^{1}, b^{1}, w^{2}, b^{2},...,w^{l}, b^{l}) = \sum_{i=1}^{m}L(y^{i},\widehat{y^{i}}) + \frac{\lambda}{2m} \sum_{l=1}^{l}\left \| w^{l} \right \|_{2}^{2} J(w1,b1,w2,b2,...,wl,bl)=i=1∑mL(yi,yi )+2mλl=1∑l∥∥wl∥∥22
d W l = ( F r o m b a c k p o p ) + λ m W l dW^{l} = (From backpop) + \frac{\lambda}{m}W^{l} dWl=(Frombackpop)+mλWl
W l = W l − α d W l = ( 1 − α λ m ) W l − α ( F r o m b a c k p o p ) − − " w e i g h t d e c a y " W^{l} = W^{l} - \alpha dW^{l} = (1 - \frac{\alpha \lambda}{m})W^{l} - \alpha (From backpop) --"weight\ \ decay" Wl=Wl−αdWl=(1−mαλ)Wl−α(Frombackpop)−−"weight decay"
Why Regularization reduces overfitting?
-
By adding the regularization item λ 2 m ∑ l = 1 l ∥ w l ∥ 2 2 \frac{\lambda}{2m} \sum_{l=1}^{l}\left \| w^{l} \right \|_{2}^{2} 2mλ∑l=1l∥∥wl∥∥22, it can penalize the weight matrices W to be reasonably close to zero. then this much simplified neural network becomes a much smaller neural network.
-
If we are using the tanh activation function and getting parameters relatively small due to regularization, then G of Z will be roughly linear. so as if every layer will be roughly linear.
Dropout Regularization
When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
- Inverted dropout
k e e p _ p r o b = 0.7 keep\_prob = 0.7 keep_prob=0.7
d = n p . r a n d o m . r a n d ( a [ l ] . s h a p e [ 0 ] , a [ l ] . s h a p e [ 1 ] ) < = k e e p _ p r o b d = np.random.rand(a^{[l]}.shape[0], a^{[l]}.shape[1]) <= keep\_prob d=np.random.rand(a[l].shape[0],a[l].shape[1])<=keep_prob
a [ l ] = a [ l ] ∗ d a^{[l]} = a^{[l]} * d a[l]=a[l]∗d
a [ l ] = a [ l ] / k e e p _ p r o b a^{[l]} = a^{[l]} / keep\_prob a[l]=a[l]/keep_prob
- Intuition: Can’t rely on any one feature, so have to spread out weights.(shrink weights)
- We don’t use dropout in the test phase.
Setting up your Optimization Problem
Normalizing Inputs
Normalizing inputs makes the cost function faster to optimize.
- Zero-mean normalization
μ = 1 m ∑ i = 1 m x i \mu = \frac{1}{m}\sum_{i=1}^{m}x_{i} μ=m1i=1∑mxi
σ 2 = 1 m ∑ i = 1 m ( x i − μ ) 2 \sigma^{2} = \frac{1}{m}\sum_{i=1}^{m}(x_{i}-\mu)^{2} σ2=m1i=1∑m(xi−μ)2
x i = x i − μ σ x_{i} = \frac{x_{i} - \mu}{\sigma} xi=σxi−μ
This normalization makes the mean of data set be 0 and the variance be 1. We should use the same μ \mu μ and σ \sigma σ to normalize test set.
- Min-max normalization
x i = x i − x m i n x m a x − x m i n x_{i} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}} xi=xmax−xminxi−xmin
Initialilzation your Weights randomly
If your weights W W W are too large or too small, it may cause exploding or vanishing gradients. To avoid this problems, we can by doing this:
W = n p . r a n d o m . r a n d ( n [ l ] . s h a p e [ 0 ] , n [ l − 1 ] . s h a p e [ 0 ] ) ∗ n p . s q r t ( 2 n [ l − 1 ] ) − − R e l u “ X a v i e r i n i t i a l i z a t i o n " W = np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{2}{n^{[l-1]}})--Relu “Xavier \ \ initialization" W=np.random.rand(n[l].shape[0],n[l−1].shape[0])∗np.sqrt(n[l−1]2)−−Relu“Xavier initialization"
or
W = n p . r a n d o m . r a n d ( n [ l ] . s h a p e [ 0 ] , n [ l − 1 ] . s h a p e [ 0 ] ) ∗ n p . s q r t ( 1 n [ l − 1 ] ) − − t a n h W = np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{1}{n^{[l-1]}})--tanh W=np.random.rand(n[l].shape[0],n[l−1].shape[0])∗np.sqrt(n[l−1]1)−−tanh
or
W = n p . r a n d o m . r a n d ( n [ l ] . s h a p e [ 0 ] , n [ l − 1 ] . s h a p e [ 0 ] ) ∗ n p . s q r t ( 2 n [ l ] n [ l − 1 ] ) W = np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{2}{n^{[l]}n^{[l-1]}}) W=np.random.rand(n[l].shape[0],n[l−1].shape[0])∗np.sqrt(n[l]n[l−1]2)
⨂ \bigotimes ⨂ TODO:Why ?
Gradient Checking
Backpropagation computes the gradients ∂ J ∂ θ \frac{\partial J}{\partial \theta} ∂θ∂J, where θ \theta θ denotes the parameters of the model. J J J is computed using forward propagation and your loss function.
Because forward propagation is relatively easy to implement, you’re confident you got that right, and so you’re almost 100% sure that you’re computing the cost J J J correctly. Thus, you can use your code for computing J J J to verify the code for computing ∂ J ∂ θ \frac{\partial J}{\partial \theta} ∂θ∂J.
Let’s look back at the definition of a derivative (or gradient):
(1)
∂
J
∂
θ
=
lim
ε
→
0
J
(
θ
+
ε
)
−
J
(
θ
−
ε
)
2
ε
\frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}
∂θ∂J=ε→0lim2εJ(θ+ε)−J(θ−ε)(1)
We know the following:
- ∂ J ∂ θ \frac{\partial J}{\partial \theta} ∂θ∂J is what you want to make sure you’re computing correctly.
- You can compute J ( θ + ε ) J(\theta + \varepsilon) J(θ+ε) and J ( θ − ε ) J(\theta - \varepsilon) J(θ−ε) (in the case that θ \theta θ is a real number), since you’re confident your implementation for J J J is correct.