Class2-Week1-Improving Deep Neural Networks_深度学习class2 week1-优快云博客

文章目录

Setting up your Machine Learning Application
- Train/Dev/Test Sets
- Bais/Variance
Regularizing Neural Network
Setting up your Optimization Problem

Setting up your Machine Learning Application

Train/Dev/Test Sets

Train set : Train models
Development set : Adjust hyperparameters, Evaluate different algorithms
Test set : Calculate the accuracy of the model

Make sure dev and test set come from the same distribution.

Bais/Variance

在这里插入图片描述

Basic Recipe

Regularizing Neural Network

L2-Regularization

$J(w^{1}, b^{1}, w^{2}, b^{2},...,w^{l}, b^{l}) = \sum_{i=1}^{m}L(y^{i},\widehat{y^{i}}) + \frac{\lambda}{2m} \sum_{l=1}^{l}\left \| w^{l} \right \|_{2}^{2}$

$dW^{l} = (From backpop) + \frac{\lambda}{m}W^{l}$

$W^{l} = W^{l} - \alpha dW^{l} = (1 - \frac{\alpha \lambda}{m})W^{l} - \alpha (From backpop) --"weight\ \ decay"$

Why Regularization reduces overfitting?

By adding the regularization item $\frac{\lambda}{2m} \sum_{l=1}^{l}\left \| w^{l} \right \|_{2}^{2}$ , it can penalize the weight matrices W to be reasonably close to zero. then this much simplified neural network becomes a much smaller neural network.
If we are using the tanh activation function and getting parameters relatively small due to regularization, then G of Z will be roughly linear. so as if every layer will be roughly linear.

Dropout Regularization

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

Inverted dropout

$keep\_prob = 0.7$

$d = np.random.rand(a^{[l]}.shape[0], a^{[l]}.shape[1]) <= keep\_prob$

$a^{[l]} = a^{[l]} * d$

$a^{[l]} = a^{[l]} / keep\_prob$

Intuition: Can’t rely on any one feature, so have to spread out weights.(shrink weights)
We don’t use dropout in the test phase.

Setting up your Optimization Problem

Normalizing Inputs

Normalizing inputs makes the cost function faster to optimize.

Zero-mean normalization
$\mu = \frac{1}{m}\sum_{i=1}^{m}x_{i}$

$\sigma^{2} = \frac{1}{m}\sum_{i=1}^{m}(x_{i}-\mu)^{2}$

$x_{i} = \frac{x_{i} - \mu}{\sigma}$

This normalization makes the mean of data set be 0 and the variance be 1. We should use the same $\mu$ and $\sigma$ to normalize test set.

Min-max normalization
$x_{i} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}}$

Initialilzation your Weights randomly

If your weights $W$ are too large or too small, it may cause exploding or vanishing gradients. To avoid this problems, we can by doing this:

$np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{2}{n^{[l-1]}})--Relu “Xavier \ \ initialization"$

$np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{1}{n^{[l-1]}})--tanh$

$np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{2}{n^{[l]}n^{[l-1]}})$

$\bigotimes$ TODO:Why ?

Gradient Checking

Backpropagation computes the gradients $\frac{\partial J}{\partial \theta}$ , where $\theta$ denotes the parameters of the model. $J$ is computed using forward propagation and your loss function.

Because forward propagation is relatively easy to implement, you’re confident you got that right, and so you’re almost 100% sure that you’re computing the cost $J$ correctly. Thus, you can use your code for computing $J$ to verify the code for computing $\frac{\partial J}{\partial \theta}$ .

Let’s look back at the definition of a derivative (or gradient):
$\frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$

We know the following:

$\frac{\partial J}{\partial \theta}$ is what you want to make sure you’re computing correctly.
You can compute $J(\theta + \varepsilon)$ and $J(\theta - \varepsilon)$ (in the case that $\theta$ is a real number), since you’re confident your implementation for $J$ is correct.