cv基础算法03-GoogleNet-v1_the default weight initialization of googlenet wil-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_22473333/article/details/108064817

东阳的学习记录，坚持就是胜利！

论文目录

Introduction
Related Work
Motivation and High Level Considerations
Architectural Details
GoogLeNet
Training Methodology
ILSVRC 2014 Classification Challenge Setup and Results
ILSVRC 2014 Detection Challenge Setup and Results
Conclusions
Acknowledgements

研究背景及意义

研究背景

NIN(Network in Network)：首个采用1*1卷积的卷积神经网络，舍弃全连接层，大大减少网络参数
Robust Object Recognition with Cortex-Like Mechanisms [15]：多尺度Gabor滤波器提取特征（多卷积核）
Hebbian principle（赫布理论）：
“Cells that fire together, wire together”
(一起激发的神经元连接在一起)

研究意义/关键点

开启多尺度卷积时代
拉开1*1卷积广泛应用序幕
为GoogLeNet系列开辟道路

网络结构

GoolNet包含BasicConv2d、Inception_module、 InceptionAux_module；分别是(瞎翻译的，不知道怎么翻译)：基本卷积层，中间层，辅助损失block（可选）

    def __init__(self, num_classes=1000, aux_logits=True, transform_input=False, init_weights=None,
                 blocks=None):
        super(GoogLeNet, self).__init__()
        if blocks is None:
            blocks = [BasicConv2d, Inception, InceptionAux]
        if init_weights is None:
            warnings.warn('The default weight initialization of GoogleNet will be changed in future releases of '
                          'torchvision. If you wish to keep the old behavior (which leads to long initialization times'
                          ' due to scipy/scipy#11299), please set init_weights=True.', FutureWarning)
            init_weights = True
        assert len(blocks) == 3
        conv_block = blocks[0]
        inception_block = blocks[1]
        inception_aux_block = blocks[2]

        self.aux_logits = aux_logits
        self.transform_input = transform_input

        self.conv1 = conv_block(3, 64, kernel_size=7, stride=2, padding=3)
        self.maxpool1 = nn.MaxPool2d(3, stride=2, ceil_mode=True)
        self.conv2 = conv_block(64, 64, kernel_size=1)
        self.conv3 = conv_block(64, 192, kernel_size=3, padding=1)
        self.maxpool2 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception3a = inception_block(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = inception_block(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception4a = inception_block(480, 192, 96, 208, 16, 48, 64)
        self.inception4b = inception_block(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = inception_block(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = inception_block(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = inception_block(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d(2, stride=2, ceil_mode=True)

        self.inception5a = inception_block(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = inception_block(832, 384, 192, 384, 48, 128, 128)

        if aux_logits:
            self.aux1 = inception_aux_block(512, num_classes)
            self.aux2 = inception_aux_block(528, num_classes)
        else:
            self.aux1 = None
            self.aux2 = None

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.2)
        self.fc = nn.Linear(1024, num_classes)

        if init_weights:
            self._initialize_weights()

Inception module

在这里插入图片描述
从上图可以看出Inception module有如下的特点：

多尺度
11卷积降维，信息融合(11卷积用来降维)（降低channel数，信息融合是什么意思？？）
3*3 max pooling 保留了特征图数量

一个Naive版本例子：

在这里插入图片描述
从上述例子中可以看到：
4. 数据量激增（引出后面用1*1卷积核降维）
5. 计算量大

一个年长的例子

在这里插入图片描述
在这个年长的例子中，引入了11卷积，对256进行了降维，用11卷积不会降低图片的分辨率。

不会改变图片的分辨率（注意：定位、检测和人体姿态识别这些任务十分注重空间分辨率信息）
降维度

结构总览

5个Block
5次分辨率下降
卷积核数量变化魔幻（不知道怎么设置的）
输出层为1层FC层（GoogleNet只有一层FC层，好像是因为FC层参数太多了）

上图中的#3*3 reduce表示为33卷积层做降维的11卷积

训练细节/技巧

辅助损失层

在Inception4b 和 Inception4e 分别增加两个辅助分类层，用于计算辅助损失，达到：

增加loss回传（减轻梯度消失现象）
充当正则回归约束，迫使中间层特征也具备分类能力。

学历率下降策略

GoogelNet用了一个非常小的学习率衰减（学习率对模型效果的影响有多少，为什么要设计这样的学习率，有什么深层含义吗），如下：
每8个epoch下降4%：fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs）

0.96^100 = 0.016 ， 800个epochs，才下降不到100倍

数据增强（这个数据增强操作仅仅适用于ImageNet）

图像尺寸均匀分布在8%-100%之间
长宽比在[3/4, 4/3]之间
Photometric distortions 有效，如亮度、饱和度和对比度等

测试技巧

Multi crop

1张图变为144张图：

Step1: 等比例缩放短边至256, 288, 320, 352，四种尺寸。
Step2: 在长边上裁剪出3个正方形，左中右或者上中下，三个位置。
Step3: 左上，右上，左下，右下，中心，全局resize，六个位置。
Step4: 水平镜像。

共有：436*2 = 144

在实际应用时并不一定要144张，看情况把！（土豪随意）

Model Fusion

模型融合：七个模型训练差异仅在图像采样方式和顺序的差异。（与以往习惯的融合方法有差异，为什么要这样）
在这里插入图片描述

稀疏结构（赫布理论的在卷积中的初步实践）

Hebbian principle：（相关性强的特征聚集在一起）

https://zh.wikipedia.org/wiki/%E8%B5%AB%E5%B8%83%E7%90%86%E8%AE%BA

多卷积的灵感来自于赫布理论（神经学理论），即打破均匀分布，使得相关性强的特征（同一个尺度的卷积）聚集在一起。（不知道我理解的对不对）

结果分析

分类结果：

模型融合：多模型比单模型精度高
Mutil Crops：crop越多，精度越高

检测结果

模型融合：多模型比单模型精度高，CNN使得目标检测精度大幅提升

思考与展望

池化损失空间分辨率，但在定位、检测和人体姿态识别中仍应用。延伸拓展：定位、检测和人体姿态识别这些任务十分注重空间分辨率信息
Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19]. (2 Related Work p2)
增加模型深度和宽度，可有效提升性能，但有2个缺点：容易过拟合，以及计算量过大
The most straightforward way of improving the performance of deep neural networks is by increasing their size. Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting.
The other drawback of uniformly increased network size is the dramatically increased use of computational resources. （3 Motivation p1 p2 p3）
为节省内存消耗，先将分辨率降低，再堆叠使用Inception module
For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion.（4 Architectural Details p5）
最后一个全连接层，是为了更方便的微调，迁移学习（只有一个FC层）
we use an extra linear layer. This enables adapting and fine-tuning our networks for other label sets easily.（5 GoogLeNet p3）
网络中间层特征对于分类也具有判别性（辅助损失）
One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. （5 GoogLeNet p4）
学习率下降策略为每8个epochs下降4%（loss曲线很平滑）
fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). (6 Training Methodology p1)
数据增强指导方针：1. 尺寸在8%-100%；2. 长宽比在[3/4, 4/3]; 3. 光照畸变有效（数据增强方法要根据数据集自身的特点选择，原则是使得训练集和测试集（或者说非训练集、潜在的次测试集）的差异尽可能小）
Still, one prescription that was verified to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3. Also, we found that the photometric distortions by Andrew Howard [8] were useful to combat overfitting to some extent. (6 Training Methodology p2)
随机插值方法可提升性能
we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing relatively late and in conjunction with other hyperparameter changes. (6 Training Methodology p2)
实际应用中没必要144 crops
We note that such aggressive cropping may not be necessary in real applications. (7 Classification p5）