东阳的学习记录,坚持就是胜利!
文章目录
研究意义
- 开启小卷积核时代:3*3卷积核成为主流模型
- 作为各类图像任务的骨干网络结构:分类、定位、检测、分割一系列图像任务大都有VGG为骨干网络的尝试
网络结构
VGG的网络结构如下:
由上图可以看到,论文共给出了5个结构(其中D为Vgg16, E为Vgg19)。
5个结构的相同点和不同点
相同点:
- 5个maxpool(降低分辨率)
- maxpool后,特征图通道数翻倍直至达到最大值512
- 3个FC层进行分类输出
- maxpool之间采用多个卷积层堆叠,对特征进行提取和抽象
为什么从11层开始?(具体原因不知道)
Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition
不同点:
A:11层卷积
A-LRN:基于A增加一个LRN
B: 第1,2个block中增加1个卷积33卷积
C: 第3, 4, 5个block分别增加1个11卷积,表明增加非线性有益于指标提升(每增加一层卷积伴随着一个ReLU层)
D:第3, 4, 5个block的11卷积替换为33,
E:第3, 4, 5个block再分别增加1个3*3卷积
深度的增加并不会带来参数数量的大量增加
下图为参数数量及其计算过程
Vgg16结构示意图
Vgg的pytorch官方实现
class VGG(nn.Module):
def __init__(self, features, num_classes=1000, init_weights=True):
super(VGG, self).__init__()
self.features = features
self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
self.classifier = nn.Sequential(
nn.Linear(512 * 7 * 7, 4096),
nn.ReLU(True),
nn.Dropout(),
nn.Linear(4096, 4096),
nn.ReLU(True),
nn.Dropout(),
nn.Linear(4096, num_classes),
)
if init_weights:
self._initialize_weights()
def forward(self, x):
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
def _initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
if m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight, 0, 0.01)
nn.init.constant_(m.bias, 0)
def make_layers(cfg, batch_norm=False):
layers = []
in_channels = 3
for v in cfg:
if v == 'M':
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
if batch_norm:
layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
else:
layers += [conv2d, nn.ReLU(inplace=True)]
in_channels = v
return nn.Sequential(*layers)
cfgs = {
'A': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'B': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
'D': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
'E': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}
def _vgg(arch, cfg, batch_norm, pretrained, progress, **kwargs):
if pretrained:
kwargs['init_weights'] = False
model = VGG(make_layers(cfgs[cfg], batch_norm=batch_norm), **kwargs)
if pretrained:
state_dict = load_state_dict_from_url(model_urls[arch],
progress=progress)
model.load_state_dict(state_dict)
return model
def vgg11(pretrained=False, progress=True, **kwargs):
r"""VGG 11-layer model (configuration "A") from
`"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`_
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
progress (bool): If True, displays a progress bar of the download to stderr
"""
return _vgg('vgg11', 'A', False, pretrained, progress, **kwargs)
Vgg的特点
- 使用33卷积核的堆叠代替大卷积核(77, 11*11)
- 增加了网络的非线性,由此增强了模型的泛化能力(因每一个3*3卷积都会附带一个非线性ReLU层)
- 减少了训练参数(小卷积核)
- 33卷积核可以看作是77卷积核的正则化,强迫77分解为33
假设输入,输出通道均为C个通道
一个77卷积核所需参数量:77CC= 49C^2
三个33卷积核所需参数量:3(33C*C)=27C^2
参数减少比: (49-27)/ 49 ≈ 44%
模型C相比于模型B,增加了11卷积核。(C比B的表现要好,那为什么不在D, E上也加上11卷积核呢)
模型训练细节
数据增强
Vgg论文中的数据增强方法主要包括两种:
- 针对位置:
- 按比例缩放图片至最小边为S(保持长宽比)
- 随机位置裁减出队224*224区域
- 随机水平翻转
- 针对颜色
- 修改RGB通道的像素值,实现颜色扰动
尺度扰动
上述S的设置有两种方法:一是设置为固定值(256、384);二是,随机值(尺度波动)
模型初始化
论文提出:深度神经网络对初始化敏感,
(3.1)The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets.
- 深度加深时,用浅层网络初始化
B,C,D,E用A模型初始化 - Multi-scale训练时,用小尺度初始化
- S=384时,用S=256模型初始化
- S=[256, 512]时,用S=384模型初始化(什么是Xavier)
模型测试细节
多尺度测试
图片等比例缩放至最短边为Q,设置三个Q,对图片进行预测,取平均
方法1 当S为固定值时:
Q = [S-32, S, S+32]
方法2 当S为随机值时:
Q = (S_min, 0.5*(S_min + S_max), S_max)
稠密测试
稠密测试(Dense test):
将FC层转换为卷积操作,变为全卷积网络,实现任意尺寸的图片输入
- 经过全卷积网络得到 NN1000 特征图
- 在通道维度上求和(sum pool)计算平均值,得到1*1000 输出向量
Multi-Crop测试
借鉴AlexNet与GoogLeNet,对图片进行Multi-crop,裁剪大小为224*224,并水平翻转1张图,缩放至3种尺寸,然后每种尺寸裁减出50张图片;50 = 5 * 5 * 2
Multi-crop & Dense
取两种方法的平均。
结果分析
单尺度评估
- 误差随深度加深而降低,当模型到达19层时,误差饱和,不再下降
- 增加1*1有助于性能提升(模型C和B之间比较)
- 训练时加入尺度扰动,有助于性能提升(由上图可以看到,加上尺度扰动后的性能比不加优)
- B模型中,33替换为55卷积,top1下降7%
多尺度评估
由上图卷红处对比,不难得出:测试时采用Scale jittering有助于性能提升
模型融合(什么是模型融合)
ILSVRC中提交的模型为7个模型融合
采用最优的两个模型
D/[256, 512]/256,384, 512
E/[256, 512]/256,384, 512
结合multi-crop和dense,得到最优结果
论文的关键点/创新点
- 堆叠小卷积核,加深网络(使用两个33代替一个55,三个33代替一个77)
- 训练阶段,尺度扰动
- 测试阶段,多尺度及Dense+Multi crop
思考与展望
-
采用小卷积核
,有点像大卷积核的正则化,可以大大减少参数量(3个33卷积核比1个77卷积核减少81%的参数)
This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to
have a decomposition through the 3 × 3 filters (with non-linearity injected (ReLU) in between)(1 Introduction p2) -
采用多尺度及稠密预测,获得高精度
Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales. (1 Introduction p2) -
1*1卷积可认为是线性变换,同时增加非线性层
(ReLU)
In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). (2.1 Architecture p1) -
padding大小设置准则
:保持卷积后特征图分辨率不变(便于计算特征图的大小)
the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution (2.1 Architecture p1) -
LRN对精度无提升(在后续的论文中基本弃用了LRN)
such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory con- sumption and computation time. (2.1 Architecture p3) -
Xavier初始化可达较好效果
(什么是Xavier?)
It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).(3.1 Trainning p2) -
S远大于224,图片可能仅包含物体的一部分(多尺度训练可以提升训练效果)
S ≫ 224 the crop will correspond to a small part of the image, containing a small object or an object part (3.1 Trainning p4) -
大尺度模型采用小尺度模型初始化,可加快收敛
To speed-up training of the S = 384 network, it was initialised with the weights pre-trained with S = 256, and we used a smaller initial learning rate of 0.001. (3.1 Trainning p5) -
物体尺寸不一,因此采用多尺度训练,可以提高精度(这里先留个疑问,好像有个
金字塔
模型,对这个有实现)
Since objects in images can be of different size, multi scale training is beneficial to take this into account during training.(3.1 Trainning p6) -
测试时不需要使用multi crop, 存在重复计算,因而低效
there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop.(3.2 Testing p2) -
multi crop可看成dense的补充,因为它们边界处理有所不同
Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions (3.2 Testing p2) -
小而深的卷积网络优于大而浅的卷积网络
which confirms that a deep net with small filters outperforms a shallow net with larger filters. (4.1 Single scale evaluation p3) -
尺度扰动
对训练和测试阶段有帮助
The results, presented in Table 4, indicate that scale jittering at test time leads to better performance(4.2 Multi scale evaluation p2)