Compelling advantages
- alleviate the vanishing-gradient problem
- strengthen feature propagation
- encourage feature reuse
- substantially reduce the number of parameters.
- requires fewer parameters
- no need to relearn redundant feature-maps
- observe that dense connections have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes
- easy to train
- *
Contributions
- maximum information flow between layers in the network, we connect all layers (with matching feature-map sizes) directly with each other.
- we never combine features through summation before they are passed into a layer; instead, we combine features by concatenating them.
Architecture
Methods
- identity function (may impede the information flow in the network)
- Dense connectivity
- Composite function
at lth composite layer: BN –> ReLU–>3 x 3 Conv - Pooling layers
1 x 1 conv –> 2 x 2 average pooling - Growth rate
lth layer has k0 + k ×(l −1) - Bottleneck layers
1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency
Experiments
Others
- ResNet: each layer reads the state from its preceding layer and writes to the subsequent layer. It changes the state but also passes on information that needs to be preserved. ResNets [11] make this information preservation explicit through additive identity transformations.
- stochastic depth was proposed as a way to successfully train a 1202-layer ResNet [13]. Stochastic depth improves the training of deep residual networks by dropping layers randomly during training.