Training and investigating Residual Nets 自己分析

http://torch.ch/blog/2016/02/04/resnets.html

我们首先需要明确的是:

卷积层这个分支是残差的功能。


1. 

  • Is it better to put batch normalization after the addition or before the addition at the end of each residual block? 

  • 讨论bn 放在addition 前面还是放在addition 后面

  • If batch normalization is placed after the addition, it has the effect of normalizing the output of the entire block. This could be beneficial(有益的). However, this also forces every skip connection(每一个跳接) to perturb(扰动) the output. This can be problematic(有问题): there are paths that allow data to pass through several successive(连续的) batch normalization layers without any other processing. Each batch normalization layer applies its own separate distortion(单独的扭曲) which compounds(复合) the original input.This has a harmful effect: we found that putting batch normalization after the addition significantly hurts test error on CIFAR, which is in line with the original paper’s recommendations.

  • The above result seems to suggest that it's important to avoid changing data that passes through identity connections only. We can take this philosophy(哲学) one step further: should we remove the ReLU layers at the end of each residual block? ReLU layers also perturb(扰动) data that flows through identity connections(身份连接), but unlike batch normalization, ReLU's idempotence(幂等性) means that it doesn’t matter if data passes through one ReLU or thirty ReLUs. When we remove ReLU layers at the end of each building block, we observe a small improvement in test performance compared to the paper's suggested ReLU placement after the addition. However, the effect is fairly minor. More exploration is needed。作用甚微

  • 得出结论:第三种方式最好。

  • Alternate optimizers(交替优化). When running a hyperparameter(超参数) search, it can often pay off(解雇) to try fancier(发烧友) optimization strategies than vanilla SGD with momentum. Fancier optimizers that make nuanced assumptions may improve training times, but they may instead have more difficulty training these very deep models. In our experiments, we compared SGD+momentum (as used in the original paper) with RMSprop, Adadelta, and Adagrad. Many of them appear to converge faster initially (see the training curve below), but ultimately, SGD+momentum has 0.7% lower test error than the second-best strategy.(这部分没看懂,大概是就几种不同的下降策略对结果的影响)

  • We used the scale and aspect ratio augmentation described in "Going Deeper with Convolutions" instead of the scale augmentation described in the ResNet paper. With ResNet-34, this improved top-1 validation error by about 1.2% points. We also used the color augmentation described in "Some Improvements on Deep Convolutional Neural Network Based Image Classification," but found that had a very small effect on ResNet-34. scale 应该是0.0078125 这样的参数。 aspect ratio 这个参数是个啥,我听说过吗?

  • Speed of ResNets vs GoogleNet and VGG-A/D


研究双层优化在学习和视觉中的应用,是为了改善学习算法和视觉系统的性能。在学习和视觉任务中,我们通常面临两个层面的优化问题。 第一层优化问题涉及到学习算法的优化,即如何通过合适的学习算法来获得最佳的模型参数。学习算法的优化过程通常涉及到定义损失函数和选择合适的优化方法。然而,常规的优化方法在高维问题中可能会面临挑战,导致在学习过程中陷入局部最优解。因此,研究者们开始探索使用双层优化方法来改进学习算法的性能。双层优化方法通过引入内部优化循环来进一步更新学习算法中的超参数,以改善模型性能。这种方法可以更好地探索参数空间,寻找更优的模型参数,从而提高学习算法的效果。 第二层优化问题涉及到视觉任务的优化,即如何通过图像处理和计算机视觉算法来解决具体的视觉问题。视觉任务可以包括目标检测、图像分割、姿态估计等多个方面。传统的视觉算法通常是通过定义特定的目标函数并使用迭代方法来进行优化。然而,这种方法可能会受到参数选择和初始条件的限制。因此,研究者们开始研究使用双层优化技术来提高视觉任务的性能。双层优化方法通过引入内部优化循环来逐步调整算法超参数和模型参数,以更好地适应特定的视觉任务。 总之,研究双层优化在学习和视觉中的应用,旨在改善学习算法和视觉系统的性能。这种方法可以通过优化学习算法的参数和模型参数,以及优化视觉任务的目标函数和算法参数,来改进学习和视觉的效果。这将有助于在学习和视觉领域取得更好的结果和应用。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值