VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION——VGG
论文作者 KarenSimonyan & Andrew Zisserman
原文 点击打开链接
I. Features网络新特点
1. Very deep networks, large scale image recognition更深,数据集更大
2. Using small receptive field 更小的卷积核
a stack of two 3×3 conv.layers (without spatial pooling in between) has an effective receptive field of5×5; three such layers have a 7 × 7 effective receptivefield.
decreases the number of parameters : e.g. 72C2 = 49C2-->3(32C2)= 27C2
2. The incorporation of 1 × 1 conv. layers is away to increase the nonlinearity of the decision
function without affecting thereceptivefields of the conv. layers发现用1 × 1卷积层可以提升效果
II. Architecture网络架构
● input224×224,subtracting the mean RGB value 输入图像224×224,只有一个channel, RGB取均值
● smallreceptive field: 3 × 3;
● 1 ×1 convolution filters are used;
● convstride:1 (padding=1)
● five max-pooling layers: 2 × 2 pixel window,with stride 2.池化层2 × 2,不重叠
● Theconfiguration of the fully connected layers is the same in all networks 每个网络后都跟3个全连接层
● Allhidden layers are equipped with ReLU, none (except one) contain Local Response Normalization (LRN) 都用ReLU,都不用LRN(测试结果表明加LRN效果没有明显提升)
与AlexNet的对比:模仿AlexNet的5conv+3fc结构,但减小卷积核的大小,增加层数,有更多的Channel数(下图AlexNet)
III. Training
1. the training is carried out by optimizing the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCunet al., 1989)) with momentum. (这里还不太理解)多项式回归,梯度下降,BP
batch size: 256 momentum:0.9 weight decay: L2=5·10−4
dropout regularization for thefirst two fully-connected layers前两层有Dropout
learning rate was initially set to 10−2, and then decreased by a factorof 10 when the validation set accuracy stopped improving
2. initialization权值初始化
began with training the configuration A , shallow enough to betrained with random initialization. When training deeper architectures, weinitialized the first four convolutional layers and the last three fully connected layerswith the layers of net A
先随机初始化A(因为A相对浅,可以随机赋初值),用A训练后的值初始化更深的网络
it is possible to initialize the weights without pre-training by usingthe random initialization procedure
不用预训练参数赋初值也是可能的(论文中没有展开叙述)
3. Trainingimage size (S --- the smallest side of an isotopically-rescaled training image)
输入图像先经过预处理缩放,有single-scale和multi-scale两种
1) single-scaletraining S=256/384
2) multi-scaletraining S∈[Smin,Smax]
每次数据输入的时候,每张图片被重新缩放,缩放的短边S随机从[256,512]中选择一个。
IV. Testing
input image rescaled---network applied---classscore map spatially averaged
multi-crop evaluation每张图像可以剪裁出多个224×224图像
V. Evaluation测试结果
A与A-LRN差别不大----加LRN没有明显作用
层越深效果越好,到VGG19在现有数据集训练效果达到饱和
C优于B----加conv1可以提升效果
D优于C----conv3比conv1效果好
multi-scale优于single scale
multi-crop & dense 效果最好