Neural Network， CNN 简介

最新推荐文章于 2024-09-25 15:25:44 发布

原创最新推荐文章于 2024-09-25 15:25:44 发布 · 553 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #pytorch #机器学习 #神经网络

Machine learning in Action 专栏收录该内容

1 篇文章

订阅专栏

本文深入探讨了深度学习中的各类激活函数，包括Sigmoid、tanh、ReLU及其变种，如Leaky ReLU、PReLU、ELU和Maxout，分析了它们在全连接层和卷积层的应用特点及优缺点。此外，文章详细解析了卷积神经网络（CNN）的工作原理，包括卷积、池化、填充等关键操作，以及CNN在计算机视觉领域的应用。

1. Activation FCN

1.1.常用于全链接层

1.1.1. Sigmoid FCN

梯度下降过程中，容易出现过饱和和造成终止梯度传递现象，且没有0中心化。
$\frac1{1 + e^{-z}}$

saturated nuerons can kill off the gradients
sigmoid outputs are not zero-centered.

1.1.2. tanh FCN

$f (x) = t a n h (x)$

Squash numbers to range[-1, 1]
zero centered
still kills gradientd when satuated

tanh graph

1.2. 常用于卷积层

ReLU FCN (Rectified Linear Unit)
收敛快， gradient求解简单。
but it still kill off haof of the gradients
$f (x) = m a x (0, x)$
be carefule with your learning rate

1.3. Other actification fcn

1.3.1. dead ReLu and active ReLu

1.3.2. Leaky ReLU

$f (x) = m a x (0.01 x, x)$

does not satuate
computationally efficient
Converges much faster than sigmoid/tanh in practice
will not kill off gradients
sometimes it could also be following formula: $\begin{cases} 1, \quad x<0 \\\ \alpha x\, +\, 1, \quad x\geq 0 \end{cases}$ where $\alpha$ is a small number.

1.3.3. Parametric Rectifier(PReLU)

$max(\alpha x, x)$

1.3.4. Exponential Linear Units(ELU)

$\begin{cases} x, \quad x>0 \\\ \alpha (e^{x} - 1), \quad x\leq 0 \end{cases}$

all benefits of ReLU
closer to zero mean outputs
negative saturation regime compared with leaky ReLU adds some robustness to noise

1.3.5. Maxout

$max(W_{1}^Tx_{1}\, +\, b_{1}, W_{2}^Tx_{2}\, +\, b_{2})$

ReLU and leaky ReLu are particular examples of Maxout

2. CNN

2.1. Computer vision

对于CNN而言，它是一块一块地对图像进行对比。而这个小块，我们称之为Features

2.2. 卷积

对图像（不同的数据窗口数据）和滤波矩阵（一组固定的权重：因为每个神经元的多个权重固定，所以又可以看作一个恒定的滤波器filter）做内积（逐个元素相乘再求和）的操作就是卷积。

下图中，途中左边部分是原始输入数据，途中中间部分是滤波器filter，图中右边是输出的二维数据。

$j)=\sum_{m}\sum_{n}I(m, n)K(i-m, j-n)$

一次操作（一层）中使用多个卷积 kernel 得到该尺度下的多张feature map。
多层（次）提取不同尺度下的不同特征信息

由于上述第一点改进，即使第一张图片输入通道只有一个通道，后面其他层的输入都是多通道。所以对应我们的convolution kernel 也是多通道。即输入图像和convolution都添加了channel 这个dimension，那么convolution layer中的convolution operation变为如下formula：
$c)=(I*K_{c})(i, j) = \sum_{c}\sum_{m}\sum_{n}I(m, n, c)K_{c}(i-m, j-n, c)$
Conclusion of feedforward calculation of convolution and corresponding function.

convolution operation 最重要的是如何确定convolution kernel的核数
BP 告诉我们如何通过监督学习方法来优化我们convolution kernel的数值，是我们能够找到在对应任务下表现最好的convolution kernel（feature）。
在我们实现的convolution layer的class中，还应该包含一个backward方法，用于反向传播求导。

2.3. 图像上的卷积

输入是一定区域大小（width*height）的数据，和filter做内积后得到新的二维数据。

Basically，左边是图像输入，中间部分是filter，不同的filter会得到不同的输出数据，比如颜色深浅、轮廓。相当于如果想要提取图像的不同特征，则用不同的filter，提取想要的关于图像的特定信息：颜色深浅或轮廓。

2.4. GIF 动态卷积

在CNN中， filter每次计算完成后，数据窗口会不断移动，直到计算完所有data。

depth：神经元个数，决定输出的depth厚度，同时代表filter个数
stride：决定滑动多少步可以到达边缘。
zero-padding（填充值）：在外围边缘补充若干圈0，方便从初始位置以stride为单位可以刚好画到末尾位置，通俗来说就是为了中场能够被stride整除。

gif

在下图中，参数如下：

depth = 2
stride = 2
zero-padding = 1

然后分别以两个filter为轴滑动数组进行convolution calculation。
左边为输入（7*7*3， 7*7 表示图像的pixel和width-height， 3表示R、G、B三个颜色channel）
中间为2 filters
右侧为result
随着左边窗口的滑动， filter对不同的局部数据进行convolution。
data窗口在滑动，导致input在发生改变，但是filter始终未发生变化，即采用了CNN中的参数（weights）共享机制
if we have m*m matrix as input, and n*n as filter, the stride is k, we will get the output matrix shape, i.e. (m-n)/k + 1, which shoule be integer or it won’t fit.

2.5. Pooling layer

pooling， basically，为区域平均或最大。

下图展示的是区域最大
max pooling, which is commonly use than aveage pooling.

max pooling

makes the representation smaller and more manageable
operates over each activation map independaently

2.6. Padding in practice

in pactice, we usually add padding border into our input matrix. And noramally, we use zero pad the border.

3. pre-work and corresponding process

3.1. weights initialisation

W = 0.01 * np.random.randn(D, H)
# it works for small networks, but it have problems in deeper neural networks

we should not initialise all weights to be zero, cause we want our neurons to do different thing
there is anoter mathod to initilise weights, which has been confirmed as practical, i.e. Calibrating the variances with $\frac1{\sqrt n}$

w = np.random.randn(n) / sqrt(n)  # where n is the num of inputs

3.2. Batch normalization

input:
$\times D$
Learnabke params:
$\gamma, \beta: D$
Intermediates:
$\mu, \sigma: D$
$\hat{x} : N \times D$
Output:
$\times D$
Update
$\mu_{j} = \frac{1}{N}\sum_{i=1}^{N}x_{i, j}$
$\sigma_{j}^2 = \frac{1}{N}\sum_{i=1}^{N}(x_{i, j}\, -\, \mu_{j})^2$
$\hat{x_{i, j}} = \frac{x_{i, j}\, -\, \sigma_{j}}{\sqrt{\sigma_{j}^2\, +\, \varepsilon}}$
$y_{i, j} = \gamma_{j}\, \hat{x_{i, j}}\, +\, \beta_{j}$

3.3. learning rate

3.4. Hyperparameter search

grid search
random search

3.5. Optimization

3.5.1. SGD

$x_{t+1}= x_{t}\, -\,\alpha\nabla f(x_{t})$

while True:
    dx = compute_gradient(x)
    x += learning_rate * dx

3.5.2. SGD + Momentum

$v_{t+1}=\rho v_{t} \, +\, \nabla f(x_{t})$
$x_{t+1} = x_{t}\, -\, \alpha v_{t+1}$

build up ‘velocity’ as a running mean of gradients
Rho gives ‘friction’, typically rho = 0.0 or 0.99

vx = 0
while True:
    dx = compute_gradient(x)
    vx = rho * vx + dx
    x += learning_rate * vx

3.5.3. Nesternov Momentum

$v_{t+1}= \rho v_{t} - \alpha \nabla f(x_{t} + \rho v_{t})$
$x_{t+1} = x_{t} \, +\, v_{t+1}$

3.5.4. AdaGrad

grad_squard = 0
while True:
    dx = compute_gradient(X)
    grad_squard += dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squard) +1e-7)

added element-wise scaling of the gradient based on the historical sum of squares in each dimension
not so common in solving questions

3.5.5. RMSProp

grad_squard = 0
while True:
    dx = compute_gradient(X)
    grad_squard = decay_rate * grad_squared + (1 -              decay_rate) * dx *dx
    x -= learning_rate * dx / (np.sqrt(grad_squard) + 1e-       7)

SGD + Momentum could do better than RMSProp, which could be a litte different and better than original SGD

3.5.6. Adam

first_moment = 0
second_moment = 0
for t in range(num_iterations):
    dx = compute _gradient(x)
    # Momentum
    first_moment = beta1 * first_moment + (1 - beta1) * dx
    second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
    # bias correstion
    first_unbias = first_moment / (1 - beta1 ** t)
    second_unbias = second_moment / (1 - beta2 ** t)
    # AdaGrad/ RMSprop
    x -= learning_rate * first_moment / ...(np.sqrt(second_moment) + 1e-7) # 1e-7 is to avoid we could divide by zero

at first time, we initilise second_moment as zero, even though we run after one time, the second_moment could be very close to zero.
bias correction for the fact that first and second moment estimates start at zero.
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a good starting point for many models.