Degradation problem
是否通过简单地叠加网络的层数来提高准确率?为了得到答案,首先要解决vanishing/exploding gradients问题,这会极大地妨碍收敛。通过normalized initialization([1,2,3,4])和intermediate normalization layers[5]解决。
当深度网络能够收敛时,精度开始饱和,然后开始快速下降。但是需要注意的是精度恶化问题并不是因为过拟合。
Residual learning
微软为了解决这个问题提出了deep residual learning结构。假设你期望得到的映射为H(x),现在我们改为拟合另外一个映射F(x):=H(x)−x,F(x)即是残差,因此我们现在只需拟合F(x)+x
Shortcut Connections
F(x)+x就可以被理解为“前馈神经网络+短连接(Shortcut Connections)”,短连接是用来跳过一层乃至更多层,短连接有许多的形式,在这里使用的是恒等映射,它不会增加参数,也不会增加计算复杂度,仍然可以使用SGD with backpropagation进行训练。
[1]Y. LeCun, L. Bottou, G. B. Orr, and K.-R.M¨ uller. Efficient backprop. In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998.
[2]X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
[3]A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120, 2013.
[4]K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015
[5]S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.