Training and investigating Residual Nets 自己分析

jinggegebuaa

于 2017-04-01 11:20:57 发布

阅读量1.4k

点赞数

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.youkuaiyun.com/keyanxiaocaicai/article/details/68942099

http://torch.ch/blog/2016/02/04/resnets.html

我们首先需要明确的是：

卷积层这个分支是残差的功能。

1.

Is it better to put batch normalization after the addition or before the addition at the end of each residual block?
讨论bn 放在addition 前面还是放在addition 后面
If batch normalization is placed after the addition, it has the effect of normalizing the output of the entire block. This could be beneficial（有益的）. However, this also forces every skip connection（每一个跳接） to perturb（扰动） the output. This can be problematic（有问题）: there are paths that allow data to pass through several successive（连续的） batch normalization layers without any other processing. Each batch normalization layer applies its own separate distortion（单独的扭曲） which compounds（复合） the original input.This has a harmful effect: we found that putting batch normalization after the addition significantly hurts test error on CIFAR, which is in line with the original paper’s recommendations.
The above result seems to suggest that it's important to avoid changing data that passes through identity connections only. We can take this philosophy（哲学） one step further: should we remove the ReLU layers at the end of each residual block? ReLU layers also perturb（扰动） data that flows through identity connections（身份连接）, but unlike batch normalization, ReLU's idempotence（幂等性） means that it doesn’t matter if data passes through one ReLU or thirty ReLUs. When we remove ReLU layers at the end of each building block, we observe a small improvement in test performance compared to the paper's suggested ReLU placement after the addition. However, the effect is fairly minor. More exploration is needed。作用甚微
得出结论：第三种方式最好。
Alternate optimizers（交替优化）. When running a hyperparameter（超参数） search, it can often pay off（解雇） to try fancier（发烧友） optimization strategies than vanilla SGD with momentum. Fancier optimizers that make nuanced assumptions may improve training times, but they may instead have more difficulty training these very deep models. In our experiments, we compared SGD+momentum (as used in the original paper) with RMSprop, Adadelta, and Adagrad. Many of them appear to converge faster initially (see the training curve below), but ultimately, SGD+momentum has 0.7% lower test error than the second-best strategy.（这部分没看懂，大概是就几种不同的下降策略对结果的影响）
We used the scale and aspect ratio augmentation described in "Going Deeper with Convolutions" instead of the scale augmentation described in the ResNet paper. With ResNet-34, this improved top-1 validation error by about 1.2% points. We also used the color augmentation described in "Some Improvements on Deep Convolutional Neural Network Based Image Classification," but found that had a very small effect on ResNet-34. scale 应该是0.0078125 这样的参数。 aspect ratio 这个参数是个啥，我听说过吗？
Speed of ResNets vs GoogleNet and VGG-A/D

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。