Papers Notes_2_ VGG--Very Deep Convolutional Networks for Lage-scale Image Recognition-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_44920947/article/details/118942104

Papers Notes_2_ VGG--Very Deep Convolutional Networks for Lage-scale Image Recognition

Architecture
Training
Testing
Conclusion
References

Architecture

在这里插入图片描述
input-224×224 RGB image
preprocess-subtract the mean RGB value,computed on the training set,from each pixel
stride-fixed to 1 pixel
padding-the spartial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3×3 conv.
Max-pooling-performed over a 2×2 pixel window, with stride 2
All hidden layers are equipped with the rectification none-linearity(ReLU).
None contain Local Response Normalization(LRN)→do not improve the performance

3×3 conv.
a stack of two 3×3 conv. layers has an effective receptive field of 5×5
three such layers have a 7×7 effective receptive field
why a stack of three 3×3 conv. layers instead of a single 7×7 layer?
① incorporate three non-linear rectification layers instead of a single one→make the decision function more discriminative
② decrease the number of parameters
three-layer 3×3 conv. stack with C channels, parameters→ $3(3^2C^2)=27C^2$
a single 7×7 conv. layer, parameters→ $7^2C^2=49C^2$
1×1 conv.
increase the non-linearity of decision function without affecting the receptive fields of the conv. layers→an additional non-linearity is introduced by rectification function

Training

details
batch size-256
momentum-0.9
weight decay-0.0005
dropout ratio-0.5, for the first two fully-connected layers
learning rate-0.01, decreased by a factor of 10 when the validation set accuracy stopped improving. decreased 3 times, learning stopped after 370K iterations(74 epochs)
initialization
began with training A, shallow enough to be trained with random initialization
training deeper, initialised first four conv. layers and the last three fully-connected layers with the layers of net A
random initialization→sample the weights from a normal distribution with the zero mean and 0.01 variance. biases initialised with 0
augment
① randomly crop from rescaled training images(one crop per image per SGD iteration)
② random horizontal flipping
③ random RGB colour shift(same with AlexNet)
rescale
S: smallest side of an isotropically-rescaled training image
two approaches for setting S
① fix S(256 or 384)→single-scale training
② randomly sample S from a certain range $S_{min},S_{max}]$ (use [256,384])→multi-scale training

Testing

rescale
Q: smallest side of an isotropically-rescaled image, not necessarily equal to the training scale S
input image size different→FC covert to conv.
first FC layer→7×7 conv. layer
the last two FC layers→1×1 conv. layers
why 7×7?
padding makes sure that the spartial resolution is preserved after conv. layer
then resolution reduces only because of maxpool, specifically, reduces half
i.e. structure A has 5 maxpool layers→224/2^5=7, so the output of conv. stack is 7×7×512
for the first FC layer, one channel corresponds to 49 weights→convert to conv. layer, use the same parameters, so the kernel size needs to be 7×7
no FC layer→input image size does not need to be 224×224→apply to the whole(uncropped) image
result→a class map, the number of channels equals to the number of classes, a variable spatial resolution dependent on the input image size
spatially average(sum-pooled)→obtain a fixed-size vector of class scores for the image
augment
rescale+horizontal flipping, average→the final scores