Class2-Week1-Improving Deep Neural Networks

Setting up your Machine Learning Application

Train/Dev/Test Sets

  1. Train set : Train models
  2. Development set : Adjust hyperparameters, Evaluate different algorithms
  3. Test set : Calculate the accuracy of the model
  • Make sure dev and test set come from the same distribution.

Bais/Variance

在这里插入图片描述

  • Basic Recipe
    在这里插入图片描述

Regularizing Neural Network

L2-Regularization

J ( w 1 , b 1 , w 2 , b 2 , . . . , w l , b l ) = ∑ i = 1 m L ( y i , y i ^ ) + λ 2 m ∑ l = 1 l ∥ w l ∥ 2 2 J(w^{1}, b^{1}, w^{2}, b^{2},...,w^{l}, b^{l}) = \sum_{i=1}^{m}L(y^{i},\widehat{y^{i}}) + \frac{\lambda}{2m} \sum_{l=1}^{l}\left \| w^{l} \right \|_{2}^{2} J(w1,b1,w2,b2,...,wl,bl)=i=1mL(yi,yi )+2mλl=1lwl22

d W l = ( F r o m b a c k p o p ) + λ m W l dW^{l} = (From backpop) + \frac{\lambda}{m}W^{l} dWl=(Frombackpop)+mλWl

W l = W l − α d W l = ( 1 − α λ m ) W l − α ( F r o m b a c k p o p ) − − " w e i g h t    d e c a y " W^{l} = W^{l} - \alpha dW^{l} = (1 - \frac{\alpha \lambda}{m})W^{l} - \alpha (From backpop) --"weight\ \ decay" Wl=WlαdWl=(1mαλ)Wlα(Frombackpop)"weight  decay"

Why Regularization reduces overfitting?

  1. By adding the regularization item λ 2 m ∑ l = 1 l ∥ w l ∥ 2 2 \frac{\lambda}{2m} \sum_{l=1}^{l}\left \| w^{l} \right \|_{2}^{2} 2mλl=1lwl22, it can penalize the weight matrices W to be reasonably close to zero. then this much simplified neural network becomes a much smaller neural network.

  2. If we are using the tanh activation function and getting parameters relatively small due to regularization, then G of Z will be roughly linear. so as if every layer will be roughly linear.

Dropout Regularization

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

  1. Inverted dropout

k e e p _ p r o b = 0.7 keep\_prob = 0.7 keep_prob=0.7

d = n p . r a n d o m . r a n d ( a [ l ] . s h a p e [ 0 ] , a [ l ] . s h a p e [ 1 ] ) &lt; = k e e p _ p r o b d = np.random.rand(a^{[l]}.shape[0], a^{[l]}.shape[1]) &lt;= keep\_prob d=np.random.rand(a[l].shape[0],a[l].shape[1])<=keep_prob

a [ l ] = a [ l ] ∗ d a^{[l]} = a^{[l]} * d a[l]=a[l]d

a [ l ] = a [ l ] / k e e p _ p r o b a^{[l]} = a^{[l]} / keep\_prob a[l]=a[l]/keep_prob

  • Intuition: Can’t rely on any one feature, so have to spread out weights.(shrink weights)
  • We don’t use dropout in the test phase.

Setting up your Optimization Problem

Normalizing Inputs

Normalizing inputs makes the cost function faster to optimize.

  • Zero-mean normalization
    μ = 1 m ∑ i = 1 m x i \mu = \frac{1}{m}\sum_{i=1}^{m}x_{i} μ=m1i=1mxi

σ 2 = 1 m ∑ i = 1 m ( x i − μ ) 2 \sigma^{2} = \frac{1}{m}\sum_{i=1}^{m}(x_{i}-\mu)^{2} σ2=m1i=1m(xiμ)2

x i = x i − μ σ x_{i} = \frac{x_{i} - \mu}{\sigma} xi=σxiμ

This normalization makes the mean of data set be 0 and the variance be 1. We should use the same μ \mu μ and σ \sigma σ to normalize test set.

  • Min-max normalization
    x i = x i − x m i n x m a x − x m i n x_{i} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}} xi=xmaxxminxixmin

Initialilzation your Weights randomly

If your weights W W W are too large or too small, it may cause exploding or vanishing gradients. To avoid this problems, we can by doing this:

W = n p . r a n d o m . r a n d ( n [ l ] . s h a p e [ 0 ] , n [ l − 1 ] . s h a p e [ 0 ] ) ∗ n p . s q r t ( 2 n [ l − 1 ] ) − − R e l u “ X a v i e r    i n i t i a l i z a t i o n &quot; W = np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{2}{n^{[l-1]}})--Relu “Xavier \ \ initialization&quot; W=np.random.rand(n[l].shape[0],n[l1].shape[0])np.sqrt(n[l1]2)ReluXavier  initialization"

or

W = n p . r a n d o m . r a n d ( n [ l ] . s h a p e [ 0 ] , n [ l − 1 ] . s h a p e [ 0 ] ) ∗ n p . s q r t ( 1 n [ l − 1 ] ) − − t a n h W = np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{1}{n^{[l-1]}})--tanh W=np.random.rand(n[l].shape[0],n[l1].shape[0])np.sqrt(n[l1]1)tanh

or

W = n p . r a n d o m . r a n d ( n [ l ] . s h a p e [ 0 ] , n [ l − 1 ] . s h a p e [ 0 ] ) ∗ n p . s q r t ( 2 n [ l ] n [ l − 1 ] ) W = np.random.rand(n^{[l]}.shape[0], n^{[l-1]}.shape[0]) * np.sqrt(\frac{2}{n^{[l]}n^{[l-1]}}) W=np.random.rand(n[l].shape[0],n[l1].shape[0])np.sqrt(n[l]n[l1]2)

⨂ \bigotimes TODO:Why ?

Gradient Checking

Backpropagation computes the gradients ∂ J ∂ θ \frac{\partial J}{\partial \theta} θJ, where θ \theta θ denotes the parameters of the model. J J J is computed using forward propagation and your loss function.

Because forward propagation is relatively easy to implement, you’re confident you got that right, and so you’re almost 100% sure that you’re computing the cost J J J correctly. Thus, you can use your code for computing J J J to verify the code for computing ∂ J ∂ θ \frac{\partial J}{\partial \theta} θJ.

Let’s look back at the definition of a derivative (or gradient):
(1) ∂ J ∂ θ = lim ⁡ ε → 0 J ( θ + ε ) − J ( θ − ε ) 2 ε \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1} θJ=ε0lim2εJ(θ+ε)J(θε)(1)

We know the following:

  • ∂ J ∂ θ \frac{\partial J}{\partial \theta} θJ is what you want to make sure you’re computing correctly.
  • You can compute J ( θ + ε ) J(\theta + \varepsilon) J(θ+ε) and J ( θ − ε ) J(\theta - \varepsilon) J(θε) (in the case that θ \theta θ is a real number), since you’re confident your implementation for J J J is correct.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值