#VGG NETWORK
Here I´d like to share my thoughts on VGG network paper, detail some problems I´ve encountered.I will mainly focus on the second part of the paper ,which is CONVNET CONFIGURATION.
CONVNET CONFIGURATION
Here we have the architecture of the VGG net from the original paper:
we have a few different architecture according to the image above, they are called VGG13, VGG11, VGG16…etc. The number of the names means how many learnable layers(convolution layer, fully connected layer). the paper has propose 2 approach from my point of view in part 2 to enhance the network performance:
First ,is using a stack of small convolution layers instead of a larger one
Second, using 1 * 1 convolution layers.
A Stack of smaller convolution filters
We can only find 2 different sizes of filters, one is 33, another is 11. The reason why VGG use a stack 33 filters is less parameters without reducing the receptive field.For a 7 * 7 filter, its receptive field is 7 * 7, as for a stack of three 33 filters with stride 1 and padding 1, we can conclude that it still has a 7 * 7 receptive field:
receptive field size before 1st filter | receptive field size after 1st filter |
---|---|
1 | 3 |
receptive field size before 2nd filter | receptive field size after 2nd filter |
3 | 5 |
receptive field size before 3rd filter | receptive field size after 3rd filter |
5 | 7 |
the formula for computing receptive field is:
Rout = Rin + (kernel size - 1) + 1
Rin means input receptive field size, and Rout receptive field size, but the parameter was reduced, according to the paper:
Second, we decrease the number of parameters: assuming that both the input and the output
of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by
3(3ˇ2 * Cˇ2) = 27Cˇ2weights; at the same time, a single 7 × 7 conv. layer would require
7ˇ2 * Cˇ2 = 49Cˇ2 parameters, i.e. 81% more. This can be seen as imposing a regularisation
on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters
(with non-linearity injected in between).
the C stands for channel number, in VGG, last layer´s output channel number is basically the same as the next layer, so we can see a Cˇ2 in the paper.So, we now know the pros of using a stack of smaller filters: less parameter without sacrificing the network performance.
1*1 CONVOLUTION FILTERS
11 convolution filters were first proposed by the paper “network in network”[1], some one would wonder why we use a 11 filter, its just multiplication. When it comes to 2 dimensions , they are right, in 2 dimensions, it is just multiplying the number of filter, but we are not dealing with the 2 dimensional case, according to the VGG architecture, we are dealing with dozens , or even hundreds of the channels, then its not just multiplication:
Imagine we have a feature map with size 10 * 10 * 100, and a filter 1 * 1 * 100, lets assume the filter are sliding through the feature map from top-left to right-bottom:
featuremap[0][0] // an array contains 100 elements, the first location which filter will slide through
filter[0][0] // a 1*1*1 size filter also contains an array with 100 elements
// now, lets do the dot product:
featuremap[0][0].dot(filter[0][0])
// the dot product above can be seen as:
// featuremap[0][0][0] * filter[0][0][0] + featuremap[0][0][1] * filter[0][0][1] + featuremap[0][0][2] + filter[0][0][2] + ..... + featuremap[0][0][99] * filter[0][0][99]
// it is same as :
// featuremap[0][0][0] * w0 + featuremap[0][0][1] * w1 + featuremap[0][0][2] * w2 + .... + featuremap[0][0][99] * w99
Do you see the patterns here? We have a fully connected layer by sliding the 11 filters through the whole feature map! You can treat filters parameters as the weights in a fc layer, then followed by a active function, in this case, a ReLU function, then we have a simple fc network in our CNN!
The advantage of using a 11 filters according to the paper is:
The incorporation of 1 × 1 conv. layers is a way to increase the non-
linearity of the decision function without affecting the receptive fields
of the conv. layers.
which been said: We can use less parameters in our network to increase the non-linearity to inhance the performance.
reference:
[1]:Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.