2021-01-25 Alexnet_dropout的耗时-优快云博客

本文详细解析了AlexNet的创新之处，包括ReLU激活、多GPU训练、LRN归一化等，展示了深度网络如何通过深度和tricks显著提升ImageNet性能，以及数据增强、Dropout等手段对抗过拟合。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

0 总结

总结

跨时代开山之作
这么多的架构、参数、trick，怎么调出来的？？？直觉、动手能力和效率都得很重要啊
证明了神经网络是有用的，而且效果很好，而且需要deep

其他博客

各tric的error rate收益总结较好

2 Dataset

imagenet 15M图片 22K类别
ILSVRC 1000图x1000类 1.2M训练集 50K验证 150K测试

预处理

降分辨率后crop到 256×256，保持宽高比
各图片减去训练集真值

评价指标：top-1和top-5

3 架构 Architecture

8层，5层卷积，3层全连接。
Trick/架构的作用

3.1 ReLU

论文里选择ReLU的主要考虑是训练速度。

有饱和非线性(Saturating Nonlinearity)的激活函数，比如tanh或者logistic，用梯度下降进行训练比ReLU要慢很多。实验数据：CIFAR-10上，4层网络训练到25%的训练集误差，需要的训练epoch数量位大约7 vs 36，慢5-6倍。

WikiPedia Rectified Linear Units (ReLUs)上，ReLU激活函数的优缺点总结

ReLU的优点：
- 生物学解释：和tanh的反对称相比，是单边的。Biological plausibility: One-sided, compared to the antisymmetry of tanh.
- 稀疏激活，例如随机的初始化网络中，大约50%激活。Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (have a non-zero output).
- 相比sigmoid类的激活函数，梯度消失问题有改善。Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.
- 计算高效。Efficient computation: Only comparison, addition and multiplication.
- 尺度无关。Scale-invariant: $\max(0,ax)=a\max(0,x){\text{ for }}a\geq 0$ .
ReLU的缺点：
- 【可规避】在零点不可微，但这也可以设置成0/1解决。Non-differentiable at zero; however, it is differentiable anywhere else, and the value of the derivative at zero can be arbitrarily chosen to be 0 or 1.
- 不是以0为中心的。Not zero-centered.
- 无界。Unbounded.
- 【Leaky ReLU】Dying ReLU problem: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and “dies”. This is a form of the vanishing gradient problem. In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high. It may be mitigated by using leaky ReLUs instead, which assign a small positive slope for x < 0 however the performance is reduced.

3.2 多GPU训练

这一节偏实现架构。

受限于当时的GPU资源，GTX 580只有3GB显存，用1.2M的数据可以训练一个大网络，但无法用一个GPU搞定。除了将kernel(neuron)均摊到两个GPU，也考虑了减少GPU间的通信，只在特定layer进行跨GPU的通信。

注意一个细节：网络大部分的参数，都在第一个全连接层中。（下文分析）

用两块GPU、更多的neuron，相比只用一半neuron放到一个GPU，top-1和top-5的错误率分别能降低1.7%和1.2%。

3.3 Local Response Normalization

LRN，局部响应归一化。

直观说明，是取相邻的一些kernel(neuron)输出进行求和，将activation除以这个和，以达到：保留大的激励，抑制小的激励。

这一思路来源于神经科学中的lateral inhibition。在AlexNet文中说，这一方式分别可以降低1.2%和1.4%的top-1和top-5错误率。

在后来，有更好的batch normalization等其他方式，可以看看Normalization发展历程一文。注意到VGG文章里说LRN没有作用，并且resnet+bn可以很好解决梯度消失问题，所以之后架构里用它的不多。

疑问：LRN对梯度消失问题是否有改善？

3.4 Overlapping Pooling

相比普通的max pooling，让pooling区域有overlap。带来0.4%和0.3%的top-1和top-5错误率降低。

s=2,z=3。在本文中，overlap通常会让训练过拟合情况稍微好点。

3.5 Overall Architecture

可以看其他一些博客

层	类型	尺寸	参数	备注
0-1	输入数据	224x224x3	=150,528	最好直接用227x227
1	conv	-	11x11x3 x48x2 = 34848	stride 4，2GPU各48维
-	conv1输出	55x55 x48x2	-	(227-11)/4+1=55
1	max pooling	-	s=2,z=3	(55-3)/2+1=27
1-2	conv1+pooling输出	27x27 x48x2	-	-
2	conv	-	5x5x48x2 x128x2	padding 2+2, s=1
	conv2输出	27x27 x128x2	-	(27-5+4)/1+1 = 27
2	max pooling	-	s=2,z=3	(27-3)/2+1=13
2-3	conv2+pooling输出	13x13 x128x2	-	-
3	conv	3x3x256 x192x2	-	padding 1+1, s=1
3-4	conv3输出	13x13 x192x2	-	(13-3+2)/1+1
4	conv	3x3x192 x192x2	-	padding 1+1, s=1
-	-	13x13x 192x2	-	(13-3+2)/1+1
5	conv	3x3x192 x128x2	-	padding 1+1, s=1
-	-	13x13 x128x2	-	(13-3+2)/1+1
-	pooling	-	x=2,z=3	(13-3)/2 +1 = 6
-	-	6x6 x128x2	-	-
6	fc	6x6x256 x4096	-	Dropout 0.5
7	fc	4096 x4096	-	Dropout 0.5
8	fc	4096 x1000	-	-

4 Reduce Overfit

总共有60M参数。容易过拟合。

4.1 Data Augmentation

translation + reflection。输入256x256 -> 224x224，扩充2048倍。在test阶段，也是生成5个 224x224 patch并镜像，得到10个的输出进行平均后再softmax。【这一方式对减少overfitting有用】
altering RGB intensities。对ImageNet训练集进行RGB的PCA，然后给训练数据随机增加主成分RGB向量的随机组合。【这一方式可以降低top-1大约1%。】

4.2 Dropout

降低错误率的一个方式，是combine the predictions of many different models，但非常耗时。
Dropout是另一种combine models的方式，但只需要增加到大约两倍的耗时。按0.5的概率随机选择某些neuron进行forward pass和back propagation。

这个技巧可以降低neuron之间复杂的co-adaptations。

测试阶段，要把所有neuron的输出乘以0.5，类似对非常非常多的dropout网络进行了平均。

Alexnet里，dropout倍用在前两个FC层。【这一方式对减少overfitting有用】

5 Details of learning

Optimization procedure:
Batch size 128
momentum 0.9
weight decay 0.0005
init learning rate 0.01, divide by 10 when validation error plateau
$ v_{i+1} := 0.9 \cdot v_i - 0.0005 \cdot \epsilon \cdot w_i - \epsilon \cdot J_{i, Di} $
$ w_{i+1} := w_i + v_{i+1} $

Initializtion:
weights Gaussian 0.01
biases of 2/4/5/fc -> 1, providing the ReLUs with positive inputs
other biases -> 0

Trained ~90 cycles for 1.2M images for ImageNet dataset.