VERY DEEP CONVOLUTIONALNETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

最新推荐文章于 2024-02-29 20:48:13 发布

为爱存在

最新推荐文章于 2024-02-29 20:48:13 发布

阅读量445

点赞数

CC 4.0 BY-SA版权

分类专栏：深度学习文章标签： VGG

本文链接：https://blog.youkuaiyun.com/baiyu9821179/article/details/79312698

深度学习专栏收录该内容

7 篇文章

订阅专栏

本文深入探讨了VGG网络的不同配置，特别是使用堆叠的小型卷积层和1×1卷积层来增强网络性能的方法。通过减少参数数量，VGG网络能够在不牺牲性能的情况下增加非线性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

#VGG NETWORK

Here I´d like to share my thoughts on VGG network paper, detail some problems I´ve encountered.I will mainly focus on the second part of the paper ,which is CONVNET CONFIGURATION.

CONVNET CONFIGURATION

Here we have the architecture of the VGG net from the original paper:

这里写图片描述

we have a few different architecture according to the image above, they are called VGG13, VGG11, VGG16…etc. The number of the names means how many learnable layers(convolution layer, fully connected layer). the paper has propose 2 approach from my point of view in part 2 to enhance the network performance:
First ,is using a stack of small convolution layers instead of a larger one
Second, using 1 * 1 convolution layers.

A Stack of smaller convolution filters

We can only find 2 different sizes of filters, one is 33, another is 11. The reason why VGG use a stack 33 filters is less parameters without reducing the receptive field.For a 7 * 7 filter, its receptive field is 7 * 7, as for a stack of three 33 filters with stride 1 and padding 1, we can conclude that it still has a 7 * 7 receptive field:

receptive field size before 1st filter	receptive field size after 1st filter
1	3
receptive field size before 2nd filter	receptive field size after 2nd filter
3	5
receptive field size before 3rd filter	receptive field size after 3rd filter
5	7

the formula for computing receptive field is:

Rout = Rin + (kernel size - 1) + 1

Rin means input receptive field size, and Rout receptive field size, but the parameter was reduced, according to the paper:

Second, we decrease the number of parameters: assuming that both the input and the output
of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by 
3(3ˇ2 * Cˇ2) = 27Cˇ2weights; at the same time, a single 7 × 7 conv. layer would require 
7ˇ2 * Cˇ2 = 49Cˇ2 parameters, i.e. 81% more. This can be seen as imposing a regularisation 
on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters 
(with non-linearity injected in between).

the C stands for channel number, in VGG, last layer´s output channel number is basically the same as the next layer, so we can see a Cˇ2 in the paper.So, we now know the pros of using a stack of smaller filters: less parameter without sacrificing the network performance.

1*1 CONVOLUTION FILTERS

11 convolution filters were first proposed by the paper “network in network”[1], some one would wonder why we use a 11 filter, its just multiplication. When it comes to 2 dimensions , they are right, in 2 dimensions, it is just multiplying the number of filter, but we are not dealing with the 2 dimensional case, according to the VGG architecture, we are dealing with dozens , or even hundreds of the channels, then its not just multiplication:
Imagine we have a feature map with size 10 * 10 * 100, and a filter 1 * 1 * 100, lets assume the filter are sliding through the feature map from top-left to right-bottom:

featuremap[0][0] // an array contains 100 elements, the first location which filter will slide through
filter[0][0] // a 1*1*1 size filter also contains an array with 100 elements

// now, lets do the dot product:
featuremap[0][0].dot(filter[0][0])

// the dot product above can be seen as:
// featuremap[0][0][0] * filter[0][0][0] + featuremap[0][0][1] * filter[0][0][1] + featuremap[0][0][2] + filter[0][0][2] + ..... + featuremap[0][0][99] * filter[0][0][99]
// it is same as :
// featuremap[0][0][0] * w0 + featuremap[0][0][1] * w1 + featuremap[0][0][2] * w2 + .... + featuremap[0][0][99] * w99

Do you see the patterns here? We have a fully connected layer by sliding the 11 filters through the whole feature map! You can treat filters parameters as the weights in a fc layer, then followed by a active function, in this case, a ReLU function, then we have a simple fc network in our CNN!
The advantage of using a 11 filters according to the paper is:

The incorporation of 1 × 1 conv. layers  is a way to increase the non-
linearity of the decision function without affecting the receptive fields 
of the conv. layers.

which been said: We can use less parameters in our network to increase the non-linearity to inhance the performance.

reference:
[1]:Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.