Class1-Week3-Neural Networks Overview-优快云博客

文章目录

Neural Network Representation
- Compute a Neural Network's Output
Activation Function
Backpropagation of Neural Network
Random Initialization
- Why we need to initialize parameters randomly?

Neural Network Representation

Like logistic regression architecture, you can from a Neural Network just by stacking a lot of little sigmoid units. As shown in the figure below:

在这里插入图片描述

Compute a Neural Network’s Output

Firstly, We set:

$W^{[1]}= \begin{bmatrix} \cdots & w_{1}^{[1]} & \cdots \\ \cdots & w_{2}^{[1]} & \cdots \\ \cdots & w_{3}^{[1]} & \cdots \\ \end{bmatrix}$

$b^{[1]} = \begin{bmatrix} \cdots & b_{1}^{[1]} & \cdots \\ \cdots & b_{2}^{[1]} & \cdots \\ \cdots & b_{3}^{[1]} & \cdots \\ \end{bmatrix}$

$A^{[0]} = \begin{bmatrix} \vdots & \vdots & \vdots \\ a^{[0](1)} & a^{[0](2)} & a^{[0](3)} \\ \vdots & \vdots & \vdots \end{bmatrix}$

The horizontally the matrix A/Z goes over different training examples
The vertically the different indices in the maxtrix A/Z goes over differect hidden units of one layer

$Z^{[1]} = W^{[1]}X + b^{[1]}$

$A^{[1]} = g^{[1]}(Z^{[1]})$

$Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$

$A^{[2]} = g^{[2]}(Z^{[2]})$

Activation Function

Why do you Need Non-linear Activation Functions?

linear activation functions just make the neural network output the linear funtion of the input no matter how many layers contains. But somtimes, linear activation functions can be used to activate the output layer or compress neural network models.

Activation Functions’ Image

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

x = np.arange(-10, 10, 0.001)
y1 = 1 / (1 + np.exp(-x))
y2 = (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
y3 = np.maximum(0, x)
y4 = np.maximum(0.1*x, x)

plt.rcParams['figure.dpi'] = 120
plt.subplots_adjust(left=2, bottom=2, right=3, top=3, 
wspace=0.5, hspace=0.5)

plt.subplot(221)
plt.plot(x, y1, label="Sigmoid")
plt.grid(color="gray", linestyle="--")
plt.title(r"$y=\frac{1}{1+e^{(-x)}}$")
plt.legend()

plt.subplot(222)
plt.plot(x, y2, label="Tanh")
plt.grid(color="gray", linestyle="--")
plt.title(r"$y=\frac{e^{(x)}-e^{(-x)}}{e^{(x)}+e^{(-x)}}$")
plt.legend()

plt.subplot(223)
plt.plot(x, y3, label="Relu")
plt.grid(color="gray", linestyle="--")
plt.title(r"$y=\frac{1}{1+e^{(-x)}}$")
plt.legend()

plt.subplot(224)
plt.plot(x, y4, label="LeakRelu")
plt.grid(color="gray", linestyle="--")
plt.title(r"$y=\frac{1}{1+e^{(-x)}}$")
plt.legend()

在这里插入图片描述

Derivatives of Activation Functions

x = np.arange(-10, 10, 0.001)
dy1 = y1* (1 - y1)
dy2 = 1 - y2 ** 2
dy3 = np.where(x>0, 1, 0)
dy4 = np.where(x>0, 1, 0.1)

plt.rcParams['figure.dpi'] = 120
plt.subplots_adjust(left=2, bottom=2, right=3 , top=3, wspace=0.5, hspace=0.5)

plt.subplot(221)
plt.plot(x, dy1, label="Sigmoid")
plt.grid(color="gray", linestyle="--")
plt.title(r"$dy=sigmoid(x)(1-sigmoid(x))$")
plt.legend()

plt.subplot(222)
plt.plot(x, dy2, label="Tanh")
plt.grid(color="gray", linestyle="--")
plt.title(r"$dy=1-tanh^{2}(x)$")
plt.legend()

plt.subplot(223)
plt.plot(x, dy3, label="Relu")
plt.grid(color="gray", linestyle="--")
plt.title(r"$")
plt.legend()

plt.subplot(224)
plt.plot(x, dy4, label="LeakRelu")
plt.grid(color="gray", linestyle="--")
plt.title(r"$")
plt.legend()

在这里插入图片描述

Different Choices

Recap:

If we set the 1st-convolution-layer’s parameters as $W_{1}, b_{1}$ , the 2nd-convolution-layer’s parameters as $W_{2}, b_{2}$ and so on. we set $a_{i}$ as the neural unit’s output, and the $z_{i}$ as the output value before passing the activation function $g$ .

According to the above statement, in CNN, the output of the first layer is $a_{1} = g(W_{1}x + b_{1})$ , the output of the second layer is $a_{2} = g(W_{2}g(W_{1}x + b_{1})+b_{2})$ , and the output of the third layer is $a_{3} = g(W_{3}g(W_{2}g(W_{1}x + b_{1})+b_{2}) + b_{3})$ .

Set the final loss as $L$ , let’s try to start with the third layer and use the BP algorithm to derive the bias of the loss on the parameters $W_{1}$ to see what happens.

Just for simplicity, I’m going to skip over the derivation, and the result is:
$\frac{\alpha L}{\alpha W_{1}} =\frac{\alpha L}{\alpha a_{3}} \frac{\alpha a_{3}}{\alpha z_{3}}W_{3} \frac{\alpha a_{2}}{\alpha z_{2}}W_{2} \frac{\alpha a_{1}}{\alpha z_{1}} \frac{\alpha z_{1}}{W_{1}}$

The derivative $\frac{\alpha a_{3}}{\alpha z_{3}}$ , $\frac{\alpha a_{2}}{\alpha z_{2}}$ and $\frac{\alpha a_{1}}{\alpha z_{1}}$ are always the chief culprit of the problem about gradient vanishing.

so choose a proper activation function is important:

Sigmoid – rarely use except in the output layer of a two-classes classification problem, because the output value between [0,1]
Tanh – The tanh activation usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer
Relu – almost used
LeakRelu – better than Relu

TODO:: (Improve later)

Backpropagation of Neural Network

[外链图片转存失败(img-bLCOVzbl-1565522831176)(attachment:Screenshot%20from%202019-08-02%2020-52-40.png)]

we can get the derivative below:

single example	m examples
$dz^{[2]} = a^{[2]} - y$	$dz^{[2]} = a^{[2]} - y$
$dW^{[2]} = dz^{[2]}a^{[1]T}$	$dW^{[2]} = \frac{1}{m}(dz^{[2]}a^{[1]T})$
$db^{[2]} = dz^{[2]}$	$db^{[2]} = \frac{1}{m}np.sum(dz^{[2]}, axis=1, keepdims=True)$
$da^{[1]} = W^{[2]T}dz^{[2]}$	$da^{[1]} = W^{[2]T}dz^{[2]}$
$dz^{[1]} = W^{[2]T}dz^{[2]} * g^{'[1]}(z^{[1]})$	$dz^{[1]} = W^{[2]T}dz^{[2]} * g^{'[1]}(z^{[1]})$
$dW^{[1]} = dz^{[1]}a^{[0]T}$	$dW^{[1]} = \frac{1}{m}(dz^{[1]}a^{[0]T})$
$db^{[2]} = dz^{[1]}$	$db^{[2]} = \frac{1}{m}np.sum(dz^{[1]}, axis=1, keepdims=True)$

Random Initialization

Reminder: The general methodology to build a Neural Network is to:
1. Define the neural network structure ( # of input units, # of hidden units, etc).
2. Initialize the model’s parameters
3. Loop:
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients
- Update parameters (gradient descent)

Why we need to initialize parameters randomly?

If we set $W^{[1]}$ as np.zeros(( $n^{[1]}$ , $n^{[0]}$ )), then $a^{[1]}_{1}$ will be equal to $a^{[1]}_{2}$ and $dz^{[1]}_{1}$ will also be equal to $dz^{[1]}_{2}$ in the backpropagation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons. It turns out that the hidden units in same layer are completely identical although you update the parameters many times.

To avoid the problem above, we shoud initialize the parameters randomly. So we can set:
$W^{[1]} = np.random.randn(n^{[1]}, n^{[0]})$

$b^{[1]} = np.zeros((n^{[1]}, n^{[0]}))$

We usually prefer to initialize the weights to very small random values so that we can get a big slope and a faster learning speed when we use sigmoid or tanh activation funtion. so:
$W^{[1]} = np.random.randn(n^{[1]}, n^{[0]}) * 0.01$