Intuition
- Current trend of CNN architecture: create short paths from early layers to later layers.
- ResNet
- Highway network: The first network with more than 100 layers, bypassing paths
- Stochastic depth: Improves the training of deep residual networks by dropping layers randomly during training, which manages to train a 1202-layer ResNet
- FractalNets
- Wide filter is helpful.
- Connect all layers with each other.
- Combine features by concatenating them (ResNet combines by summation).
- DenseNet layers are very narrow (12 feature-maps per layer), resulting in less parameters
Model
- Dense connectivity: concatenate all the preceding layers:
xl=Hl([x0,x1,…,xl−1]) - Composite function:
H_l
is defined asBN + ReLU + 3x3 Conv
- Pooling and dense block
- See figure above
- Transition layer between dense blocks, consist of
BN + 1x1 Conv + 2x2 AveragePooing
- Growth rate
k
:
- The number of output feature-maps.
- The
l
-th layer will havek x (l-1) + k_0
input feature-maps (k_0
:input image channels)
- Bottleneck layers
- Introduce 1x1 Conv before 3x3 Conv to reduce number of feature-maps will improve computation efficiency.
H_l
is changed toBN + ReLU + 1x1 Conv + BN + ReLU + 3x3 Conv
1x1 Conv
reduce the input to4k
feature-maps in the experiment.
- Compression
- Reduce feature-maps number in transition layer by factor θ
Experiment
- Datasets
- CIFAR-10/100, 32x32
- Zero-padded with 4 pixels on each side
- Randomly cropped to again produce 32×32 images
- Half of the images are then horizontally mirrored
- SVHN (Street View House Numbers), 32x32
- ImageNet, 224x224, 1.2m for training, 50000 for validation, 1000 classes
- CIFAR-10/100, 32x32
- Settings: weight decay 10e-4, Nesterov momentum of 0.9 w\o dampening, learning rate 0.1 with decay scheme, dropout when no data augmentation
- Accuracy result:
- 3.46% on CIFAR-10, L=190, k=40
- 17.18% on CIFAR-100, L=190, k=40
- 1.59% on SVHN, L=100, k=24
- Capacity: the performance continues improving when L, k increase, showing the DenseNet is less prone to overfitting (???)
- Parameter efficiency.