VGG网络详解-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_22473333/article/details/108035345

东阳的学习记录，坚持就是胜利！

研究意义

开启小卷积核时代：3*3卷积核成为主流模型
作为各类图像任务的骨干网络结构：分类、定位、检测、分割一系列图像任务大都有VGG为骨干网络的尝试

网络结构

VGG的网络结构如下：
在这里插入图片描述
由上图可以看到，论文共给出了5个结构（其中D为Vgg16, E为Vgg19）。

5个结构的相同点和不同点

相同点：

5个maxpool（降低分辨率）
maxpool后，特征图通道数翻倍直至达到最大值512
3个FC层进行分类输出
maxpool之间采用多个卷积层堆叠，对特征进行提取和抽象
为什么从11层开始？（具体原因不知道）
Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task of street number recognition

不同点：

A：11层卷积
A-LRN：基于A增加一个LRN
B：第1，2个block中增加1个卷积33卷积
C：第3， 4， 5个block分别增加1个11卷积，表明增加非线性有益于指标提升（每增加一层卷积伴随着一个ReLU层）
D：第3， 4， 5个block的11卷积替换为33，
E：第3， 4， 5个block再分别增加1个3*3卷积

深度的增加并不会带来参数数量的大量增加

下图为参数数量及其计算过程
在这里插入图片描述

Vgg16结构示意图

在这里插入图片描述

Vgg的pytorch官方实现

class VGG(nn.Module):

    def __init__(self, features, num_classes=1000, init_weights=True):
        super(VGG, self).__init__()
        self.features = features
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)


def make_layers(cfg, batch_norm=False):
    layers = []
    in_channels = 3
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    return nn.Sequential(*layers)


cfgs = {
    'A': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'B': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'D': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
    'E': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}


def _vgg(arch, cfg, batch_norm, pretrained, progress, **kwargs):
    if pretrained:
        kwargs['init_weights'] = False
    model = VGG(make_layers(cfgs[cfg], batch_norm=batch_norm), **kwargs)
    if pretrained:
        state_dict = load_state_dict_from_url(model_urls[arch],
                                              progress=progress)
        model.load_state_dict(state_dict)
    return model


def vgg11(pretrained=False, progress=True, **kwargs):
    r"""VGG 11-layer model (configuration "A") from
    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`_

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _vgg('vgg11', 'A', False, pretrained, progress, **kwargs)

Vgg的特点

使用33卷积核的堆叠代替大卷积核（77， 11*11）
增加了网络的非线性，由此增强了模型的泛化能力（因每一个3*3卷积都会附带一个非线性ReLU层）
减少了训练参数（小卷积核）
33卷积核可以看作是77卷积核的正则化，强迫77分解为33

假设输入，输出通道均为C个通道
一个77卷积核所需参数量：77CC= 49C^2
三个33卷积核所需参数量：3(33C*C)=27C^2
参数减少比：（49-27）/ 49 ≈ 44%

模型C相比于模型B，增加了11卷积核。（C比B的表现要好，那为什么不在D， E上也加上11卷积核呢）

模型训练细节

数据增强

Vgg论文中的数据增强方法主要包括两种：

针对位置：
- 按比例缩放图片至最小边为S（保持长宽比）
- 随机位置裁减出队224*224区域
- 随机水平翻转
针对颜色
- 修改RGB通道的像素值，实现颜色扰动

尺度扰动

上述S的设置有两种方法：一是设置为固定值（256、384）；二是，随机值（尺度波动）

模型初始化

论文提出：深度神经网络对初始化敏感，
（3.1）The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets.

深度加深时，用浅层网络初始化
B，C，D，E用A模型初始化
Multi-scale训练时，用小尺度初始化
- S=384时，用S=256模型初始化
- S=[256, 512]时，用S=384模型初始化（什么是Xavier）

模型测试细节

多尺度测试

图片等比例缩放至最短边为Q，设置三个Q，对图片进行预测，取平均

方法1 当S为固定值时:
Q = [S-32, S, S+32]

方法2 当S为随机值时：
Q = (S_min, 0.5*(S_min + S_max), S_max）

稠密测试

稠密测试（Dense test）:将FC层转换为卷积操作，变为全卷积网络，实现任意尺寸的图片输入
在这里插入图片描述

经过全卷积网络得到 NN1000 特征图
在通道维度上求和（sum pool）计算平均值，得到1*1000 输出向量

Multi-Crop测试

借鉴AlexNet与GoogLeNet，对图片进行Multi-crop，裁剪大小为224*224，并水平翻转1张图，缩放至3种尺寸，然后每种尺寸裁减出50张图片；50 = 5 * 5 * 2

Multi-crop & Dense

取两种方法的平均。

结果分析

在这里插入图片描述

单尺度评估

误差随深度加深而降低，当模型到达19层时，误差饱和，不再下降
增加1*1有助于性能提升（模型C和B之间比较）
训练时加入尺度扰动，有助于性能提升（由上图可以看到，加上尺度扰动后的性能比不加优）
B模型中，33替换为55卷积，top1下降7%

多尺度评估

在这里插入图片描述
由上图卷红处对比，不难得出：测试时采用Scale jittering有助于性能提升

模型融合（什么是模型融合）

在这里插入图片描述
ILSVRC中提交的模型为7个模型融合

采用最优的两个模型
D/[256, 512]/256,384, 512
E/[256, 512]/256,384, 512
结合multi-crop和dense，得到最优结果

论文的关键点/创新点

堆叠小卷积核，加深网络（使用两个33代替一个55，三个33代替一个77）
训练阶段，尺度扰动
测试阶段，多尺度及Dense+Multi crop

思考与展望

采用小卷积核，有点像大卷积核的正则化，可以大大减少参数量（3个33卷积核比1个77卷积核减少81%的参数）
This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to
have a decomposition through the 3 × 3 filters (with non-linearity injected （ReLU） in between)（1 Introduction p2）
采用多尺度及稠密预测，获得高精度
Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales. (1 Introduction p2)
1*1卷积可认为是线性变换，同时增加非线性层（ReLU）
In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). (2.1 Architecture p1)
padding大小设置准则：保持卷积后特征图分辨率不变（便于计算特征图的大小）
the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution (2.1 Architecture p1)
LRN对精度无提升（在后续的论文中基本弃用了LRN）
such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory con- sumption and computation time. (2.1 Architecture p3)
Xavier初始化可达较好效果（什么是Xavier?）
It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).（3.1 Trainning p2）
S远大于224，图片可能仅包含物体的一部分（多尺度训练可以提升训练效果）
S ≫ 224 the crop will correspond to a small part of the image, containing a small object or an object part （3.1 Trainning p4）
大尺度模型采用小尺度模型初始化，可加快收敛
To speed-up training of the S = 384 network, it was initialised with the weights pre-trained with S = 256, and we used a smaller initial learning rate of 0.001. (3.1 Trainning p5)
物体尺寸不一，因此采用多尺度训练，可以提高精度（这里先留个疑问，好像有个金字塔模型，对这个有实现）
Since objects in images can be of different size, multi scale training is beneficial to take this into account during training.(3.1 Trainning p6)
测试时不需要使用multi crop，存在重复计算，因而低效
there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop.(3.2 Testing p2)
multi crop可看成dense的补充，因为它们边界处理有所不同
Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions (3.2 Testing p2)
小而深的卷积网络优于大而浅的卷积网络
which confirms that a deep net with small filters outperforms a shallow net with larger filters. （4.1 Single scale evaluation p3）
尺度扰动对训练和测试阶段有帮助
The results, presented in Table 4, indicate that scale jittering at test time leads to better performance（4.2 Multi scale evaluation p2）