2017CS231n笔记7.训练神经网络（下）

最新推荐文章于 2020-07-22 16:14:33 发布

原创最新推荐文章于 2020-07-22 16:14:33 发布 · 649 阅读

1 ·

CC 4.0 BY-SA版权

李飞飞CS231n学习笔记（太监）专栏收录该内容

12 篇文章

订阅专栏

博客主要介绍了三个方面内容。一是Fancier optimization优化方法，涵盖SGD + Momentum、Nesterov Momentum等多种方法及参数调整；二是Regularization正则化，包括常规方法如抓爆，以及其他方法如BN、DropConnect等；三是Transfer Learning迁移学习，可利用预训练模型进行迁移。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

概述

在线Latex公式
本节包含三个内容：

Fancier optimization 优化方法
Regularization正则化
Transfer Learning迁移学习
第一块内容是重点，但是基本零零散散在ng或李宏毅的课里面都有讲-。-

Fancier optimization优化方法

这里面提到了很多优化方法，稍微列举一下，有个别没有在ng的课讲过，这里讲得还是比较粗。。。
SGD的缺点：
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Very slow progress along shallow dimension, jitter along steep direction
在这里插入图片描述
What if the loss function has a local minima or saddle point?
Zero gradient, gradient descent gets stuck.

SGD + Momentum

Build up “velocity” as a running mean of gradients
Rho gives “friction”; typically rho=0.9 or 0.99
这里的变量叫法和ng里面不太一样。

Nesterov Momentum

这个方法没怎么听懂，先记下来
在这里插入图片描述
这里提到这个算法用起来有点麻烦（不好同时计算损失函数和梯度），不过用换元法后就可以解决这个问题。

AdaGrad

Q: What happens with AdaGrad?
如果有两个数据轴，一个轴有较高梯度，一个轴有较小的梯度，在较小的梯度方向，AdaGrad累加梯度后除以一个比较小的数字，加速了该方向的训练进度，在较大的梯度方向，AdaGrad累加梯度后除以一个比较大的数字。
Q2: What happens to the step size over long time?
随着时间增加，学习步长会慢慢减少。

RMSProp

在这里插入图片描述
黑色：SGD、蓝色：SGD+momentum、红色：RMSProp

Adam

在这里插入图片描述
Bias correction for the fact that first and second moment estimates start at zero
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!

参数的调整

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
学习率是第一重要的超参数！
然后讲了拟牛顿法，这个真讲得粗，详细的可以去看李航的《统计学习方法》
然后提到BGFS算法

Quasi-Newton methods (BGFS most popular): instead of inverting the Hessian (O(n^3)), approximate
inverse Hessian with rank 1 updates over time (O(n^2) each).
L-BFGS (Limited memory BFGS): Does not form/store the full inverse Hessian.
L-BFGS usually works very well in full batch, deterministic mode
i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
L-BFGS does not transfer very well to mini-batch setting.
Gives bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.
建议
Adam is a good default choice in most cases
If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)

Model Ensembles

这个李宏毅的课讲得很清楚，不过这里讲得更加前沿，例如利用训练过程中的快照进行组合

Regularization正则化

常规正则化方法

在这里插入图片描述

抓爆

NN最常见的正则化方法，没有之一。
In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common.
在FC里面抓爆是随机设置隐藏层中的某些神经元为0，在CNN的convolution层中也可以用抓爆，这个时候设置的是某几个feature map为0。
抓爆为什么有用？
第一个解释和李课中绑手训练说法相同。另外一个解释是：
Dropout is training a large ensemble of models (that share parameters).
Each binary mask is one model
可以看做是在训练不同的子集，然后进行组合ensemble。
关于为什么抓爆之后为什么要除以（乘以）激活系数（激活系数倒数）
在这里插入图片描述
乘还是除的运算可以放在训练阶段，可以利用GPU的并行，测试阶段就可以不变。
使用抓爆会使得训练时间变长，但鲁棒性更好。

正则化思想

这里提到了正则化的通用思想：在训练的过程中加入一些随机性，防止模型对于训练数据过拟合，而在测试的时候消除这个随机性，使得模型的泛化能力变强。
Training: Add random noise
Testing: Marginalize over the noise

其他正则化方法

这里提到了利用这个思想的方法有：BN，data augmentation
还有一个类似抓爆的算法：DropConnect，它不是将激活函数归零而是吧权重归零。
在这里插入图片描述
Fractional Max Pooling

Stochastic Depth

会不会同时使用这个正则化方法？
一般先用BN，如果有过拟合现象则可以加入别的正则化方法。不会一开始就直接用多种方法。

Transfer Learning迁移学习

在这里插入图片描述
熟悉的表格。。。

小结：
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?

Find a very large dataset that has similar data, train a big ConvNet there
Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of pretrained models so you don’t need to train your own
Caffe: https://github.com/BVLC/caffe/wiki/Model-Zoo
TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision