VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION 用于大规模图像识别的极深度卷积网络
ABSTRACT 摘要
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3)(3 × 3)(3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
本研究探讨了卷积网络深度在大规模图像识别任务中对准确率的影响。我们的主要贡献是采用极小(3×3)(3×3)(3×3)卷积核架构对不同深度网络进行了全面评估,结果表明当网络深度达到16-19个权重层时,能在现有最优配置基础上实现显著提升。这些发现构成了我们参加2014年ImageNet竞赛的基础方案,最终团队在定位和分类赛道分别斩获冠亚军。我们还证明该表征方法能有效迁移至其他数据集,并取得当前最优性能。为促进深度视觉表征在计算机视觉领域的进一步研究,我们已将两个最佳卷积网络模型开源发布。
1 INTRODUCTION 引言
Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale image and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image repositories, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012). In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature encodings (Perronnin et al., 2010) (the winner of ILSVRC-2011) to deep ConvNets (Krizhevsky et al., 2012) (the winner of ILSVRC-2012).
卷积神经网络(ConvNets)近年来在大规模图像和视频识别领域取得了巨大成功(Krizhevsky等人,2012;Zeiler & Fergus,2013;Sermanet等人,2014;Simonyan & Zisserman,2014),这得益于大型公共图像数据库如ImageNet(Deng等人,2009)以及GPU或大规模分布式集群等高性能计算系统(Dean等人,2012)的发展。其中,ImageNet大规模视觉识别挑战赛(ILSVRC)(Russakovsky等人,2014)对深度视觉识别架构的进步起到了关键作用,该赛事作为数代大规模图像分类系统的试验场——从高维浅层特征编码(Perronnin等人,2010)(ILSVRC-2011冠军)到深度卷积网络(Krizhevsky等人,2012)(ILSVRC-2012冠军)都曾在此验证其性能。
With ConvNets becoming more of a commodity in the computer vision field, a number of attempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers.
随着卷积神经网络在计算机视觉领域日益普及,研究者们进行了诸多尝试以改进Krizhevsky等人(2012年)提出的初始架构,旨在获得更高精度。例如,ILSVRC2013竞赛中表现最优异的方案(Zeiler & Fergus,2013;Sermanet等,2014)采用了更小的首层卷积感受野尺寸和步长。另一类改进方法则聚焦于在全图像和多尺度上进行密集网络训练与测试(Sermanet等,2014;Howard,2014)。本文重点探讨卷积神经网络架构设计的另一个关键维度——网络深度。为此,我们固定架构的其他参数,通过增加更多卷积层来逐步加深网络,这种设计得以实现是因为所有层级都采用了极小尺寸(3×3)的卷积核。
As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models1 to facilitate further research.
因此,我们提出了显著更精确的卷积神经网络架构。这些架构不仅在ILSVRC分类和定位任务上达到了最先进的准确率,还能应用于其他图像识别数据集——即使作为相对简单流程的一部分(例如未经微调的线性支持向量机分类深度特征),它们依然能取得优异的性能。为促进进一步研究,我们已公开发布两个性能最佳的模型。
The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect. 3, and the configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions.
本文其余部分的组织结构如下。第2节描述了我们的卷积网络配置。第3节详细介绍了图像分类训练和评估过程,第4节则在ILSVRC分类任务上对各种配置进行了比较。第5节对全文做出总结。为保持完整性,附录A还描述并评估了我们在ILSVRC-2014目标定位系统,附录B讨论了极深度特征在其他数据集上的泛化能力。最后,附录C列出了论文的主要修订记录。
2 CONVNET CONFIGURATIONS 卷积网络配置
To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.
为了在公平环境下衡量增加卷积网络深度带来的改进,我们所有的卷积网络层配置均采用相同设计原则,其灵感源自Ciresan等人(2011年)和Krizhevsky等人(2012年)的研究。本节首先描述卷积网络配置的通用布局(第2.1节),随后详述评估中使用的具体配置(第2.2节)。最后在第2.3节中讨论我们的设计选择,并与现有技术进行对比。
2.1 ARCHITECTURE 架构
During training, the input to our ConvNets is a fixed-size 224×224224 × 224224×224 RGB image. The only preprocessing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3×33 × 33×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1×11 × 11×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3×33 × 33×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×22 × 22×2 pixel window, with stride 2.
在训练过程中,我们向卷积神经网络输入的是固定尺寸224×224224×224224×224的RGB图像。唯一的预处理操作是从每个像素中减去在训练集上计算得到的平均RGB值。图像会经过一系列卷积层处理,在这些层中我们使用感受野极小的3×33×33×3卷积核(这是能捕捉左右/上下/中心概念的最小尺寸)。在某个网络配置中,我们还采用了1×11×11×1卷积核,这可以视为对输入通道进行线性变换(随后接入非线性单元)。卷积步长固定为1像素;卷积层输入的空间填充方式确保了卷积后空间分辨率不变,即3×3卷积层采用1像素填充。空间池化操作由五个最大池化层完成,这些池化层位于部分卷积层之后(并非所有卷积层后都接最大池化)。最大池化在2×22×22×2像素窗口上进行,步长为2。
A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 409640964096 channels each, the third performs 1000-way ILSVRC classification and thus contains 100010001000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.
一系列卷积层(不同架构中其深度各异)之后连接着三个全连接层(FC):前两层各包含409640964096个通道,第三层执行100010001000类别的ILSVRC分类任务,因此设有100010001000个通道(每个类别对应一个通道)。最后一层为softmax层。所有网络中的全连接层配置均保持一致。
All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012).
所有隐藏层均配备了修正线性单元(ReLU)(Krizhevsky等人,2012)非线性激活函数。我们注意到,除一个网络外,所有网络均未采用局部响应归一化(LRN)技术(Krizhevsky等人,2012):如第4节所示,该归一化操作不仅无法提升ILSVRC数据集上的性能,还会增加内存消耗和计算时间。在应用LRN层的场景中,其参数设置均遵循(Krizhevsky等人,2012)的原始配置。
2.2 CONFIGURATIONS 配置
The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512512512.
本文评估的卷积神经网络配置如表1所示,每列对应一种架构。下文将用名称(A至E)指代各网络。所有配置均遵循第2.1节提出的通用设计框架,仅存在深度差异:网络A包含111111个权重层(8个卷积层和3个全连接层),网络E则扩展至19个19个19个权重层(161616个卷积层和333个全连接层)。卷积层宽度(通道数)设置较为精简,首层为646464通道,每个最大池化层后通道数倍增,最终达到512512512通道。

表1:ConvNet配置(按列显示)。随着层数的增加(新增层以粗体显示),配置的深度从左(A)到右(E)递增。卷积层参数表示为"conv〈感受野大小〉-〈通道数〉"。为简洁起见,未显示ReLU激活函数。
In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).
在表2中,我们报告了每种配置的参数数量。尽管网络深度较大,但我们网络中的权重数量并不比具有更大卷积层宽度和感受野的更浅网络中的权重数量多((Sermanet et al., 2014)中的1.44亿权重)。

表2:参数量(单位:百万)。
2.3 DISCUSSION 讨论
Our ConvNet configurations are quite different from the ones used in the top-performing entries of the ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler & Fergus, 2013; Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×1111 × 1111×11 with stride 444 in (Krizhevsky et al., 2012), or 7×77 × 77×7 with stride 222 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3×33 × 33×3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3×33 × 33×3 conv. layers (without spatial pooling in between) has an effective receptive field of 5×55 × 55×5; three such layers have a 7×77 × 77×7 effective receptive field. So what have we gained by using, for instance, a stack of three 3×33 × 33×3 conv. layers instead of a single 7×77 × 77×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by 3(32C2)=27C23 (3^2C^2) = 27C^23(32C2)=27C2weights; at the same time, a single 7×77 × 77×7 conv. layer would require 72C2=49C27^2C^2 = 49C^272C2=49C2 parameters, i.e.81%81\%81% more. This can be seen as imposing a regularisation on the 7×77 × 77×7 conv. filters, forcing them to have a decomposition through the 3×33 × 33×3 filters (with non-linearity injected in between).
我们的卷积网络配置与ILSVRC-2012(Krizhevsky等人,2012)和ILSVRC-2013竞赛(Zeiler & Fergus,2013;Sermanet等人,2014)中表现最佳的参赛方案有显著差异。我们没有在第一个卷积层使用较大的感受野(例如Krizhevsky等人2012年采用的步幅444的 11×1111×1111×11卷积核,或Zeiler & Fergus 2013年与Sermanet等人2014年采用的步幅222的7×77×77×7卷积核),而是在整个网络中始终使用非常小的3×33×33×3感受野,并以步幅111对输入图像的每个像素进行卷积运算。可以明显看出,两个3×33×33×3卷积层的叠加(中间不进行空间池化)相当于5×55×55×5的有效感受野;三个这样的层则形成7×77×77×7的有效感受野。那么,我们通过使用三个3×33×33×3卷积层替代单个7×77×77×7层获得了什么优势呢?首先,我们引入了三个非线性修正层而非单个,这使得决策函数更具判别力。其次,我们减少了参数量:假设三层3×33×33×3卷积堆栈的输入和输出都有CCC个通道,则该堆栈的参数量为3(32C2)=27C23(3²C²)=27C²3(32C2)=27C2个权重;而单个7×77×77×7卷积层则需要72C2=49C27²C²=49C²72C2=49C2个参数,即多出818181%。这可以被视为对7×7卷积滤波器施加了正则化约束,强制其通过3×33×33×3滤波器(并在其间注入非线性)进行分解。
The incorporation of 1×11 × 11×1 conv. layers (configuration C, Table 1) is a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv. layers. Even though in our case the 1×11 × 11×1 convolution is essentially a linear projection onto the space of the same dimensionality (the number of input and output channels is the same), an additional non-linearity is introduced by the rectification function. It should be noted that 1×11 × 11×1 conv. layers have recently been utilised in the “Network in Network” architecture of Lin et al. (2014).
采用1×11×11×1卷积层(配置C,表1)是一种在不影响卷积层感受野的情况下增加决策函数非线性的方法。尽管在我们的案例中,1×11×11×1卷积本质上是对相同维度空间(输入和输出通道数相同)的线性投影,但整流函数引入了额外的非线性。值得注意的是,Lin等人(2014)提出的"Network in Network"架构中最近已使用了1×11×11×1卷积层。
Small-size convolution filters have been previously used by Ciresan et al. (2011), but their nets are significantly less deep than ours, and they did not evaluate on the large-scale ILSVRC dataset. Goodfellow et al. (2014) applied deep ConvNets (111111 weight layers) to the task of street number recognition, and showed that the increased depth led to better performance. GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task, was developed independently of our work, but is similar in that it is based on very deep ConvNets(222222 weight layers) and small convolution filters (apart from 3×33 × 33×3, they also use 1×11 × 11×1 and 5×55 × 55×5 convolutions). Their network topology is, however, more complex than ours, and the spatial resolution of the feature maps is reduced more aggressively in the first layers to decrease the amount of computation. As will be shown in Sect. 4.5, our model is outperforming that of Szegedy et al. (2014) in terms of the single-network classification accuracy.
Ciresan等人(2011年)曾使用过小尺寸卷积滤波器,但其网络深度远不及我们的架构,且未在大型ILSVRC数据集上进行评估。Goodfellow等人(2014年)将深度卷积网络(111111个权重层)应用于街景门牌号识别任务,证明增加深度可提升性能。ILSVRC-2014分类任务冠军GoogLeNet(Szegedy等人,2014年)虽与我们的研究独立开展,但同样采用极深卷积网络(222222个权重层)和小型卷积核(除3×33×33×3外还使用1×11×11×1 和 5×55×55×5 卷积)。不过其网络拓扑结构比我们更复杂,并在初始层更激进地降低特征图空间分辨率以减少计算量。如第4.5节所示,我们的模型在单网络分类准确率方面优于Szegedy等人(2014年)的成果。
3 CLASSIFICATION FRAMEWORK 分类框架
In the previous section we presented the details of our network configurations. In this section, we describe the details of classification ConvNet training and evaluation.
在上一节中,我们详细介绍了网络配置的具体内容。本节将阐述分类卷积神经网络的训练与评估细节。
3.1 TRAINING 训练
The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later). Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum. The batch size was set to 256256256, momentum to 0.90.90.9. The training was regularised by weight decay (the L2L_2L2 penalty multiplier set to 5⋅10−45 · 10^{−4}5⋅10−4) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.50.50.5). The learning rate was initially set to 10−210^{−2}10−2, and then decreased by a factor of 101010 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K370K370K iterations (747474 epochs). We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.
卷积网络的训练流程总体上遵循Krizhevsky等人(2012)的方法(除了从多尺度训练图像中采样输入裁剪区域,这点将在后文说明)。具体而言,训练过程采用带动量的小批量梯度下降法(基于反向传播(LeCun等,1989))来优化多项逻辑回归目标函数。批次大小设置为256256256,动量参数为0.90.90.9。训练过程通过权重衰减(L2惩罚乘数设为5×10−45×10^{−4}5×10−4)和前两个全连接层的dropout正则化(dropout比率设为0.50.50.5)进行正则化。初始学习率设为10−210^{−2}10−2,当验证集准确率停止提升时,学习率会降低为原来的1/101/101/10。学习率总共降低了333次,训练在37万次迭代(747474个epoch)后终止。我们推测,尽管与(Krizhevsky等,2012)相比我们的网络参数更多、深度更大,但由于以下原因网络需要更少的训练周期即可收敛:(a) 更大网络深度和更小卷积滤波器尺寸带来的隐式正则化效应;(b) 某些层的预初始化处理。
The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fullyconnected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning. For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and 10−210^{−2}10−2 variance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010).
网络权重的初始化至关重要,因为糟糕的初始化会因深度网络中的梯度不稳定而导致学习停滞。为解决这个问题,我们首先训练配置A(表1),其网络深度较浅,可采用随机初始化训练。随后在训练更深架构时,前四个卷积层和最后三个全连接层均以网络A的对应层进行初始化(中间层仍采用随机初始化)。预初始化层的学习率未作降低,允许其在学习过程中调整。对于随机初始化的情况(如适用),权重采样自均值为零、方差为10−210^{-2}10−2的正态分布,偏置初始化为零值。值得注意的是,在论文提交后我们发现,采用Glorot & Bengio(2010)的随机初始化方法可不依赖预训练直接完成权重初始化。
To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012). Training image rescaling is explained below.
为了获得固定尺寸224×224224×224224×224的卷积神经网络输入图像,训练图像在缩放后进行了随机裁剪(每次SGD迭代每张图像裁剪一次)。为了进一步扩充训练集,裁剪后的图像还经过了随机水平翻转和随机RGB色彩偏移处理(Krizhevsky等人,2012年)。训练图像的缩放方法将在下文说明。
Training image size. Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to SSS as the training scale). While the crop size is fixed to 224×224224 × 224224×224, in principle SSS can take on any value not less than 224224224: for S=224S = 224S=224 the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for S≫224S ≫ 224S≫224 the crop will correspond to a small part of the image, containing a small object or an object part.
训练图像尺寸。设SSS为各向同性缩放的训练图像的最小边长,从中裁剪出ConvNet的输入(我们也将S称为训练尺度)。虽然裁剪尺寸固定为224×224224×224224×224,但原则上SSS可以取不小于224224224的任何值:当S=224S=224S=224时,裁剪将覆盖整个图像的统计信息,完全跨越训练图像的最小边;当S≫224S≫224S≫224时,裁剪将对应于图像的一小部分,包含一个小物体或物体的一部分。
We consider two approaches for setting the training scale SSS. The first is to fix SSS, which corresponds to single-scale training (note that image content within the sampled crops can still represent multiscale image statistics). In our experiments, we evaluated models trained at two fixed scales: S=256S = 256S=256 (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) and S=384S = 384S=384. Given a ConvNet configuration, we first trained the network using S=256S = 256S=256. To speed-up training of the S=384S = 384S=384 network, it was initialised with the weights pre-trained with S=256S = 256S=256, and we used a smaller initial learning rate of 10−310^{−3}10−3.
我们考虑两种设定训练尺度SSS的方法。第一种是固定SSS值,这对应于单尺度训练(需注意采样裁剪区域内的图像内容仍可呈现多尺度图像统计特性)。在实验中,我们评估了两种固定尺度训练的模型:S=256S=256S=256(该尺度已在现有技术中广泛使用(Krizhevsky等人,2012; Zeiler & Fergus, 2013; Sermanet等人,2014))和S=384S=384S=384。对于给定的卷积网络配置,我们首先使用S=256S=256S=256训练网络。为了加速S=384S=384S=384网络的训练过程,我们采用S=256预训练的权重进行初始化,并设置较小的初始学习率10−310^{−3}10−3。
The second approach to setting SSS is multi-scale training, where each training image is individually rescaled by randomly sampling SSS from a certain range [Smin,Smax][S_{min}, S_{max}][Smin,Smax] (we used Smin=256S_{min} = 256Smin=256 and Smax=512S_{max} = 512Smax=512). Since objects in images can be of different size, it is beneficial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales. For speed reasons, we trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed S=384S = 384S=384.
设置SSS的第二种方法是多尺度训练,即每张训练图像通过从特定范围[Smin,Smax][S_{min}, S_{max}][Smin,Smax](我们使用Smin=256S_{min}=256Smin=256 和 Smax=512S_{max}=512Smax=512)随机采样SSS值进行单独缩放。由于图像中的物体可能具有不同尺寸,在训练过程中考虑这一点是有益的。这也可以被视为通过尺度抖动进行的训练集增强,使得单个模型能够学习识别各种尺度范围内的物体。出于速度考虑,我们通过微调单尺度模型的所有层来训练多尺度模型——该单尺度模型采用相同配置且使用固定S=384S=384S=384进行预训练。
3.2 TESTING 测试
At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as QQQ (we also refer to it as the test scale). We note that QQQ is not necessarily equal to the training scale SSS (as we will show in Sect. 4, using several values of QQQ for each SSS leads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7×77 × 77×7 conv. layer, the last two FC layers to 1×11 × 11×1 conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.
在测试阶段,给定一个训练好的卷积网络和输入图像,其分类流程如下:首先对图像进行各向同性缩放,使其短边达到预设尺寸QQQ(称为测试尺度)。值得注意的是,QQQ不一定等于训练尺度SSS(如第4节所示,对每个SSS采用多个QQQ值能提升性能)。随后,按照(Sermanet等, 2014)类似的方式在缩放后的测试图像上密集应用网络:先将全连接层转换为卷积层(第一个FC层转为7×77×77×7卷积层,最后两个FC层转为1×11×11×1卷积层),再将得到的全卷积网络应用于完整(未裁剪)图像。这将生成通道数等于类别数的分类得分图,其空间分辨率随输入图像尺寸变化。最后通过对分类得分图进行空间平均(求和池化),获得固定尺寸的图像类别得分向量。我们还通过图像水平翻转来扩充测试集——将原始图像与翻转图像的soft-max类别概率取平均,得到图像的最终得分。
Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop. At the same time, using a large set of crops, as done by Szegedy et al. (2014), can lead to improved accuracy, as it results in a finer sampling of the input image compared to the fully-convolutional net. Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured. While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using 50 crops per scale (5×55 × 55×5 regular grid with 222 flips), for a total of 150150150 crops over 333 scales, which is comparable to 144144144 crops over 444 scales used by Szegedy et al. (2014).
由于全卷积网络应用于整幅图像,测试时无需采样多个裁剪区域(Krizhevsky等,2012),这种逐块处理效率较低,因为每个裁剪区域都需要重新计算网络。与此同时,如Szegedy等(2014)采用的密集裁剪策略虽然能通过更精细的输入图像采样提升精度,但与全卷积网络相比计算成本更高。多裁剪评估与密集评估具有互补性——这源于不同的卷积边界条件:当卷积网络处理裁剪区域时,卷积特征图采用零填充;而密集评估对相同裁剪区域的自然填充来自图像相邻区域(通过卷积和空间池化实现),这显著扩大了网络感受野,从而捕捉更多上下文信息。尽管我们认为实践中多裁剪带来的计算时间增长与潜在精度提升不成正比,但为便于对照,我们仍采用每尺度505050个裁剪(5×55×55×5规则网格加222次翻转)进行评估,三个尺度共计150150150个裁剪,这与Szegedy等(2014)采用的444尺度144144144个裁剪具有可比性。
3.3 IMPLEMENTATION DETAILS 实施细节
Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.
我们的实现基于公开可用的C++ Caffe工具箱(Jia,2013年,2013年12月分支版本),但包含若干重大修改,使我们能够在单台系统的多块GPU上进行训练和评估,同时支持全尺寸(未经裁剪)图像的多尺度训练评估(如前所述)。多GPU训练采用数据并行策略,通过将每批训练图像分割为多个GPU批次实现,各GPU并行处理其批次数据。当所有GPU完成批次梯度计算后,对这些梯度进行平均以获得完整批次的梯度值。梯度计算过程在各GPU间保持同步,因此最终结果与单GPU训练完全一致。
While more sophisticated methods of speeding up ConvNet training have been recently proposed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.
虽然最近已提出更复杂的加速卷积网络训练方法(Krizhevsky,2014),这些方法对网络不同层采用模型并行和数据并行策略,但我们发现本方案在现成的4-GPU系统上已能实现3.75倍的加速比(相较于单GPU)。在配备四块NVIDIA Titan Black显卡的系统上,训练单个网络需要2-3周时间(具体时长取决于网络架构)。
4 CLASSIFICATION EXPERIMENTS 分类实验
Dataset. In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges). The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K100K100K images with held-out class labels). The classification performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories.
数据集。本节展示了所述卷积网络架构在ILSVRC-2012数据集(用于2012-2014年ILSVRC竞赛)上取得的图像分类结果。该数据集包含1000个类别的图像,分为三个子集:训练集(130万张图像)、验证集(5万张图像)和测试集(10万张图像,测试集类别标签未公开)。分类性能采用两项指标评估:top-1错误率和top-5错误率。前者是多类别分类错误率,即错误分类图像的比例;后者是ILSVRC的主要评估标准,计算的是真实类别不在预测的前五个类别中的图像比例。
For the majority of experiments, we used the validation set as the test set. Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al., 2014).
在大多数实验中,我们将验证集用作测试集。部分实验也在测试集上进行,并作为"VGG"团队参赛作品提交至ILSVRC-2014竞赛的官方服务器(Russakovsky等人,2014年)。
4.1 SINGLE SCALE EVALUATION 单尺度评估
We begin with evaluating the performance of individual ConvNet models at a single scale with the layer configurations described in Sect. 2.2. The test image size was set as follows: Q=SQ = SQ=S for fixed SSS, and Q=0.5(Smin+Smax)Q = 0.5(S_{min} + S_{max})Q=0.5(Smin+Smax) for jittered S∈[Smin,Smax]S ∈ [S_{min}, S_{max}]S∈[Smin,Smax]. The results of are shown in Table 3. First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E)(B–E)(B–E).
我们首先评估单个ConvNet模型在单一尺度下的性能,使用的层配置如第2.2节所述。测试图像尺寸设置如下:固定SSS时Q=SQ = SQ=S;抖动S∈[Smin,Smax]S ∈ [S_{min}, S_{max}]S∈[Smin,Smax]时Q=0.5(Smin+Smax)Q = 0.5(S_{min} + S_{max})Q=0.5(Smin+Smax)。结果如表3所示。首先我们注意到,使用局部响应归一化(A-LRN网络)相比没有任何归一化层的A模型并没有提升效果。因此我们在更深的架构(B-E)中不再采用归一化处理。

表3:ConvNet在单一测试尺度下的性能
Second, we observe that the classification error decreases with the increased ConvNet depth: from 111111 layers in A to 191919 layers in E. Notably, in spite of the same depth, the configuration C (which contains three 1×11 × 11×1 conv. layers), performs worse than the configuration D, which uses 3 × 3 conv. layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C). The error rate of our architecture saturates when the depth reaches 191919 layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five 5×55 × 55×5 conv. layers, which was derived from B by replacing each pair of 3×33 × 33×3 conv. layers with a single 5×55 × 55×5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 77%7 higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.
其次,我们观察到分类误差随着卷积网络深度的增加而降低:从A网络的111111层到E网络的191919层。值得注意的是,尽管深度相同,但包含三个1×11×11×1卷积层的配置C性能低于全程使用3×33×33×3卷积层的配置D。这表明虽然额外的非线性确实有帮助(C优于B),但通过使用具有显著感受野的卷积滤波器来捕捉空间上下文同样重要(D优于C)。当深度达到191919层时,我们架构的错误率趋于饱和,但对于更大的数据集,更深层的模型可能更有优势。我们还比较了网络B与具有五个5×55×55×5卷积层的浅层网络(该网络通过将每对3×33×33×3卷积层替换为单个5×55×55×5卷积层从B派生而来,其感受野如第2.3节所述相同)。测量显示浅层网络的top-1误差比B网络高77%7(基于中心裁剪测试),这证实了采用小滤波器的深层网络优于使用大滤波器的浅层网络。
Finally, scale jittering at training time (S∈[256;512])(S ∈ [256; 512])(S∈[256;512]) leads to significantly better results than training on images with fixed smallest side (S=256 or S=384)(S = 256 {\,} or {\,} S = 384)(S=256orS=384), even though a single scale is used at test time. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.
最后,在训练时采用尺度抖动(S∈[256;512])(S ∈ [256; 512])(S∈[256;512])比固定最小边尺寸(S=256或S=384)(S = 256或S = 384)(S=256或S=384)训练获得的结果显著更好,即便测试时仅使用单一尺度。这证实了通过尺度抖动增强训练集确实有助于捕捉多尺度的图像统计特征。
4.2 MULTI-SCALE EVALUATION 多尺度评估
Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time. It consists of running a model over several rescaled versions of a test image (corresponding to different values of QQQ), followed by averaging the resulting class posteriors. Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models trained with fixed SSS were evaluated over three test image sizes, close to the training one: Q={S−32, S, S+32}Q = \{S − 32, {\,}S, {\,}S + 32 \}Q={S−32,S,S+32}. At the same time, scale jittering at training time allows the network to be applied to a wider range of scales at test time, so the model trained with variable S ∈ [Smin; Smax]was evaluated over a larger range of sizes Q={Smin, 0.5(Smin+Smax), Smax}Q = \{S_{min}, {\,} 0.5(S_{min} + S_{max}), {\,}S_{max}\}Q={Smin,0.5(Smin+Smax),Smax}.
在评估了单尺度下的卷积网络模型后,我们现测试尺度抖动对模型的影响。该方法通过对测试图像进行多种比例缩放(对应不同QQQ值)分别输入模型,最后对输出的类别后验概率取平均值。鉴于训练尺度与测试尺度差异过大会导致性能下降,采用固定训练尺度SSS的模型会在三个接近训练尺寸的测试尺度上进行评估:Q={S−32, S, S+32}Q = \{S − 32, {\,}S, {\,}S + 32 \}Q={S−32,S,S+32}。与此同时,训练阶段采用尺度抖动技术可使网络适配更广范围的测试尺度,因此对采用可变训练尺度S ∈ [Smin; Smax]的模型,我们使用更大范围的测试尺寸进行评估:Q={Smin, 0.5(Smin+Smax), Smax}Q = \{S_{min}, {\,} 0.5(S_{min} + S_{max}), {\,}S_{max}\}Q={Smin,0.5(Smin+Smax),Smax}。
The results, presented in Table 4, indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale, shown in Table 3). As before, the deepest configurations (D and E) perform the best, and scale jittering is better than training with a fixed smallest side SSS. Our best single-network performance on the validation set is 24.8%/7.5%24.8\%/7.5\%24.8%/7.5% top-1/top-5 error (highlighted in bold in Table 4). On the test set, the configuration E achieves 7.37.3%7.3 top-5 error.
表4所示结果表明,测试时的尺度抖动能带来更好的性能(与表3所示单一尺度评估的同一模型相比)。和之前一样,最深层的配置(D和E)表现最佳,且尺度抖动优于固定最小边SSS的训练方式。我们在验证集上的最佳单网络性能为 24.8%/7.5%24.8\%/7.5\%24.8%/7.5%的top-1/top-5错误率(表4中加粗显示)。在测试集上,配置E实现了7.3%7.3\%7.3%的top-5错误率。

表4:ConvNet在多个测试尺度下的性能。
4.3 MULTI-CROP EVALUATION 多作物评估
In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for details). We also assess the complementarity of the two evaluation techniques by averaging their softmax outputs. As can be seen, using multiple crops performs slightly better than dense evaluation, and the two approaches are indeed complementary, as their combination outperforms each of them. As noted above, we hypothesize that this is due to a different treatment of convolution boundary conditions.
在表5中,我们将密集卷积网络评估与多裁剪评估进行了比较(详见第3.2节)。我们还通过平均两种评估技术的softmax输出来评估它们的互补性。可以看出,使用多裁剪评估的效果略优于密集评估,而且这两种方法确实具有互补性——它们的组合效果优于单独使用任一种方法。如前所述,我们认为这是由于对卷积边界条件的不同处理方式所导致的。

表5:ConvNet评估技术对比。在所有实验中,训练尺度SSS从[256;512][256;512][256;512]范围内采样,并考虑了三个测试尺度 Q:{256,384,512}Q:\{256,384,512\}Q:{256,384,512}。
4.4 CONVNET FUSION 卷积网络融合
Up until now, we evaluated the performance of individual ConvNet models. In this part of the experiments, we combine the outputs of several models by averaging their soft-max class posteriors. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014).
到目前为止,我们评估的是单个ConvNet模型的性能。在这部分实验中,我们通过平均多个模型的softmax类别后验概率来融合它们的输出结果。由于模型之间的互补性,这一方法提升了整体性能,并在2012年(Krizhevsky等人)和2013年(Zeiler & Fergus;Sermanet等人)ILSVRC竞赛的顶级参赛方案中得到应用。
The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 777 networks has 7.3%7.3\%7.3% ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to 7.0%7.0\%7.0% using dense evaluation and 6.8%6.8\%6.8% using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves 7.1%7.1\%7.1% error (model E, Table 5).
结果如表6所示。在提交ILSVRC时,我们仅训练了单尺度网络以及多尺度模型D(仅微调全连接层而非所有层)。由777个网络组成的集成模型在ILSVRC测试集上获得了7.3%7.3\%7.3%的错误率。提交后,我们尝试仅集成两个表现最佳的多尺度模型(配置D和E),采用密集评估时测试错误率降至7.0%7.0\%7.0%,结合密集评估与多裁剪评估时错误率进一步降至6.8%6.8\%6.8%。作为参照,我们表现最佳的单模型(表5中的模型E)错误率为7.1%7.1\%7.1%。

表6:多卷积神经网络融合结果。
4.5 COMPARISON WITH THE STATE OF THE ART 与现有技术的比较
Finally, we compare our results with the state of the art in Table 7. In the classification task of ILSVRC-2014 challenge (Russakovsky et al., 2014), our “VGG” team secured the 2nd place with 7.3%7.3\%7.3% test error using an ensemble of 7 models. After the submission, we decreased the error rate to 6.8%6.8\%6.8% using an ensemble of 2 models.
最后,我们在表7中将我们的结果与当前最优水平进行了比较。在ILSVRC-2014挑战赛(Russakovsky等人,2014)的分类任务中,我们的"VGG"团队使用7个模型的集成获得了7.3%7.3\%7.3%的测试错误率,位列第二。在提交后,我们通过使用2个模型的集成将错误率降低到了6.8%6.8\%6.8%。

表7:与ILSVRC分类领域最新技术的对比。我们的方法标记为“VGG"。仅报告未使用外部训练数据获得的结果。
As can be seen from Table 7, our very deep ConvNets significantly outperform the previous generation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competitions. Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7%6.7\%6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2%11.2\%11.2% with outside training data and 11.7%11.7\%11.7% without it. This is remarkable, considering that our best result is achieved by combining just two models – significantly less than used in most ILSVRC submissions. In terms of the single-net performance, our architecture achieves the best result (7.0%7.0\%7.0% test error), outperforming a single GoogLeNet by 0.9%0.9\%0.9%. Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.
从表7可以看出,我们的极深度卷积神经网络显著超越了在ILSVRC-2012和ILSVRC-2013竞赛中取得最佳成绩的上一代模型。我们的结果与分类任务冠军(错误率为6.7%6.7\%6.7%的GoogLeNet)相比也极具竞争力,并大幅超越了ILSVRC-2013获胜方案Clarifai(使用外部训练数据时错误率为11.2%11.2\%11.2%,未使用时为11.7%11.7\%11.7%)。值得注意的是,我们的最佳结果仅通过组合两个模型实现——这远少于大多数ILSVRC参赛方案使用的模型数量。在单网络性能方面,我们的架构取得了最佳结果(测试错误率7.0%7.0\%7.0%),比单GoogLeNet网络性能高出0.9%0.9\%0.9%。需要特别说明的是,我们并未脱离LeCun等人(1989)提出的经典卷积网络架构,而是通过大幅增加深度来改进它。
5 CONCLUSION 结论
In this work we evaluated very deep convolutional networks (up to 19 weight layers) for largescale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations.
在这项工作中,我们评估了用于大规模图像分类的极深度卷积网络(多达19个权重层)。研究表明,表征深度有利于提升分类准确率,并且通过大幅增加传统卷积网络架构(LeCun等人,1989;Krizhevsky等人,2012)的深度,可以在ImageNet挑战数据集上实现最先进的性能。附录部分还表明,我们的模型能很好地泛化到各种任务和数据集,其表现与基于较浅图像表征构建的更复杂识别流程相当或更优。我们的结果再次证实了深度在视觉表征中的重要性。
REFERENCES 参考文献
Bell, S., Upchurch, P., Snavely, N., and Bala, K. Material recognition in the wild with the materials in context database. CoRR, abs/1412.0623, 2014.
Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. In Proc. BMVC., 2014.
Cimpoi, M., Maji, S., and Vedaldi, A. Deep convolutional filter banks for texture recognition and segmentation.CoRR, abs/1411.6836, 2014.
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. Flexible, high performance convolutional neural networks for image classification. In IJCAI, pp. 1237–1242, 2011.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. Large scale distributed deep networks. In NIPS, pp. 1232–1240, 2012.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013.
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C., Winn, J., and Zisserman, A. The Pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.
Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE CVPR Workshop of Generative Model Based Vision, 2004.
Girshick, R. B., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524v5, 2014. Published in Proc. CVPR, 2014.
Gkioxari, G., Girshick, R., and Malik, J. Actions and attributes from wholes and parts. CoRR, abs/1412.2604, 2014.
Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, volume 9, pp. 249–256, 2010.
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In Proc. ICLR, 2014.
Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007.
He, K., Zhang, X., Ren, S., and Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. CoRR, abs/1406.4729v2, 2014.
Hoai, M. Regularized max pooling for image categorization. In Proc. BMVC., 2014.
Howard, A. G. Some improvements on deep convolutional neural network based image classification. In Proc. ICLR, 2014.
Jia, Y. Caffe: An open source convolutional architecture for fast feature embedding.http://caffe.berkeleyvision.org/, 2013.
Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. CoRR, abs/1412.2306, 2014.
Kiros, R., Salakhutdinov, R., and Zemel, R. S. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014.
Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. In Proc. CVPR, 2014.
Perronnin, F., S ́anchez, J., and Mensink, T. Improving the Fisher kernel for large-scale image classification. InProc. ECCV, 2010.
Razavian, A., Azizpour, H., Sullivan, J., and Carlsson, S. CNN Features off-the-shelf: an Astounding Baseline for Recognition. CoRR, abs/1403.6382, 2014.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In Proc. ICLR, 2014.
Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. CoRR, abs/1406.2199, 2014. Published in Proc. NIPS, 2014.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., and Yan, S. CNN: Single-label to multi-label. CoRR, abs/1406.5726, 2014.
Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. Published in Proc. ECCV, 2014.
A LOCALISATION 本地化
In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth. In this section, we turn to the localisation task of the challenge, which we have won in 2014 with 25.3%25.3\%25.3% error. It can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes, irrespective of the actual number of objects of the class. For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge, with a few modifications. Our method is described in Sect. A.1 and evaluated in Sect. A.2.
在论文主体部分,我们研究了ILSVRC挑战赛的分类任务,并对不同深度的卷积网络架构进行了全面评估。本节我们将转向2014年以25.3%25.3\%25.3%误差率夺冠的定位任务。该任务可视为目标检测的特例——无论实际存在多少个目标实例,都需要为每个top-5类别预测单个边界框。我们基本采用ILSVRC-2013定位冠军Sermanet等人的方法(2014年发表),并进行了若干改进。具体方法详见附录A.1节,评估结果见附录A.2节。
A.1 LOCALISATION CONVNET 本地化卷积网络
To perform object localisation, we use a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores. A bounding box is represented by a 4-D vector storing its center coordinates, width, and height. There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al., 2014)) or is class-specific (per-class regression, PCR). In the former case, the last layer is 4-D, while in the latter it is 4000-D (since there are 1000 classes in the dataset). Apart from the last bounding box prediction layer, we use the ConvNet architecture D (Table 1), which contains 16 weight layers and was found to be the best-performing in the classification task (Sect. 4).
为了实现目标定位,我们采用了一个非常深的卷积网络,其中最后一个全连接层用于预测边界框位置而非类别分数。边界框由一个4维向量表示,存储其中心坐标、宽度和高度。这里存在两种选择:边界框预测是否在所有类别间共享(单类别回归,SCR)或是针对每个类别单独预测(逐类别回归,PCR)。前者最后一层为4维输出,后者则由于数据集中包含1000个类别而输出4000维。除最终的边界框预测层外,我们使用表1所示的D架构卷积网络——该结构包含16个权重层,在分类任务中表现最佳(见第4节)。
Training. Training of localisation ConvNets is similar to that of the classification ConvNets (Sect. 3.1). The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth. We trained two localisation models, each on a single scale: S=256S = 256S=256 and S=384S = 384S=384 (due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission). Training was initialised with the corresponding classification models (trained on the same scales), and the initial learning rate was set to 10−310^{−3}10−3. We explored both fine-tuning all layers and fine-tuning only the first two fully-connected layers, as done in (Sermanet et al., 2014). The last fully-connected layer was initialised randomly and trained from scratch.
训练 定位卷积神经网络的训练过程与分类卷积神经网络类似(参见第3.1节)。主要区别在于我们将逻辑回归目标替换为欧几里得损失函数,该函数会惩罚预测边界框参数与真实值之间的偏差。我们训练了两个定位模型,分别采用单一尺度:S=256S = 256S=256 和 S=384S = 384S=384(由于时间限制,在ILSVRC-2014提交时未使用训练尺度抖动技术)。训练以对应尺度的分类模型(在相同尺度上训练)为初始化基础,初始学习率设为10−310^{-3}10−3。我们探索了两种微调方式:微调所有层(如Sermanet等人2014年所述)和仅微调前两个全连接层。最后一个全连接层采用随机初始化并从头开始训练。
Testing. We consider two testing protocols. The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class (to factor out the classification errors). The bounding box is obtained by applying the network only to the central crop of the image.
测试 我们采用两种测试协议。第一种用于在验证集上比较不同的网络修改,仅考虑对真实类别的边界框预测(以排除分类错误的影响)。边界框是通过仅对图像的中心裁剪区域应用网络得到的。
The second, fully-fledged, testing procedure is based on the dense application of the localisation ConvNet to the whole image, similarly to the classification task (Sect. 3.2). The difference is that instead of the class score map, the output of the last fully-connected layer is a set of bounding box predictions. To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coordinates), and then rates them based on the class scores, obtained from the classification ConvNet. When several localisation ConvNets are used, we first take the union of their sets of bounding box predictions, and then run the merging procedure on the union. We did not use the multiple pooling offsets technique of Sermanet et al. (2014), which increases the spatial resolution of the bounding box predictions and can further improve the results.
第二项成熟的测试流程基于将定位卷积神经网络密集应用于整张图像,类似于分类任务(第3.2节)。不同之处在于,最后一个全连接层的输出不是类别得分图,而是一组边界框预测。为生成最终预测结果,我们采用Sermanet等人(2014)提出的贪心合并策略:首先合并空间位置邻近的预测(通过坐标取平均值),然后根据分类卷积网络获得的类别得分进行评级。当使用多个定位卷积网络时,我们首先取其边界框预测集的并集,再对该并集执行合并流程。我们未采用Sermanet等人(2014)提出的多重池化偏移技术——该技术能提升边界框预测的空间分辨率并可能进一步改善结果。
A.2 LOCALISATION EXPERIMENTS 本地化实验
In this section we first determine the best-performing localisation setting (using the first test protocol), and then evaluate it in a fully-fledged scenario (the second protocol). The localisation error is measured according to the ILSVRC criterion (Russakovsky et al., 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.50.50.5.
在本节中,我们首先确定性能最佳的定位设置(使用第一个测试协议),然后在完备场景中对其评估(第二个协议)。定位误差根据ILSVRC标准(Russakovsky等人,2014年)进行测量,即当预测边界框与真实边界框的交并比超过0.50.50.5时,该预测被视为正确。
Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR. We also note that fine-tuning all layers for the localisation task leads to noticeably better results than fine-tuning only the fully-connected layers (as done in (Sermanet et al., 2014)). In these experiments, the smallest images side was set to S=384S = 384S=384; the results with S=256S = 256S=256 exhibit the same behaviour and are not shown for brevity.
设置对比 如表8所示,逐类回归(PCR)的表现优于类别无关的单类回归(SCR),这与Sermanet等人(2014年)的研究结果不同,在他们的研究中SCR优于PCR。我们还注意到,为定位任务微调所有层明显比仅微调全连接层(如Sermanet等人2014年所做)效果更好。在这些实验中,图像的最小边设为S=384S=384S=384;S=256S=256S=256时的结果表现出相同趋势,为简洁起见未予展示。

表8:采用简化测试协议时不同修改方案的定位误差——边界框由单一中心图像裁剪预测得出,且使用真实类别标签。所有卷积网络层(除最后一层外)均采用配置D(见表1),最后一层则执行单类别回归(SCR)或按类别回归(PCR)。
Fully-fledged evaluation. Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014). As can be seen from Table 9, application of the localisation ConvNet to the whole image substantially improves the results compared to using a center crop (Table 8), despite using the top-5 predicted class labels instead of the ground truth. Similarly to the classification task (Sect. 4), testing at several scales and combining the predictions of multiple networks further improves the performance.
全面评估 在确定了最佳定位设置(PCR、全层微调)后,我们现在将其应用于完整场景:使用我们性能最佳的分类系统(第4.5节)预测前5个类别标签,并采用Sermanet等人(2014)的方法合并多个密集计算的边界框预测。如表9所示,尽管使用预测的前5类别标签而非真实标注,但将定位卷积网络应用于整张图像的效果相比使用中心裁剪(表8)有显著提升。与分类任务(第4节)类似,在多尺度下测试并整合多个网络的预测结果能进一步提升性能。

表9:定位误差
Comparison with the state of the art. We compare our best localisation result with the state of the art in Table 10. With 25.3%25.3\%25.3% test error, our “VGG” team won the localisation challenge of ILSVRC-2014 (Russakovsky et al., 2014). Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al., 2014), even though we used less scales and did not employ their resolution enhancement technique. We envisage that better localisation performance can be achieved if this technique is incorporated into our method. This indicates the performance advancement brought by our very deep ConvNets – we got better results with a simpler localisation method, but a more powerful representation.
与现有技术的比较。 我们将最佳定位结果与当前最优技术对比列于表10。凭借25.3%25.3\%25.3%的测试误差,我们的"VGG"团队赢得了ILSVRC-2014定位挑战赛(Russakovsky等人,2014)。值得注意的是,我们的结果明显优于ILSVRC-2013冠军Overfeat(Sermanet等人,2014),尽管我们使用了更少的尺度且未采用其分辨率增强技术。我们预测若将该技术融入我们的方法,可获得更优的定位性能。这表明极深度卷积网络带来的性能提升——通过更简单的定位方法和更强大的表征能力,我们获得了更好的结果。

表10:与ILSVRC定位领域最新技术的比较。我们的方法标记为"VGG"。
B GENERALISATION OF VERY DEEP FEATURES 极深特征的泛化
In the previous sections we have discussed training and evaluation of very deep ConvNets on the ILSVRC dataset. In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting. Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin. Following that line of work, we investigate if our models lead to better performance than more shallow models utilised in the state-of-the-art methods. In this evaluation, we consider two models with the best classification performance on ILSVRC (Sect. 4) – configurations “Net-D” and “Net-E” (which we made publicly available).
在前面的章节中,我们讨论了在ILSVRC数据集上训练和评估极深卷积网络的方法。本节我们将评估这些预训练于ILSVRC的卷积网络作为特征提取器在其他小型数据集上的表现——由于过拟合问题,在这些数据集上从头训练大型模型并不可行。最近,这种应用场景引起了广泛关注(Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014),因为实践证明,在ILSVRC上学到的深度图像表征能很好地迁移到其他数据集,其表现远超手工设计的特征表征。沿着这一研究方向,我们探讨了我们的模型是否能比现有技术中使用的较浅层模型带来更好的性能。本次评估采用了在ILSVRC上(第4章)分类性能最优的两个模型配置——“Net-D"和"Net-E”(我们已公开这两个模型)。
To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales. The resulting image descriptor is L2L_2L2-normalised and combined with a linear SVM classifier, trained on the target dataset. For simplicity, pre-trained ConvNet weights are kept fixed (no fine-tuning is performed).
为了利用在ILSVRC上预训练的卷积神经网络(ConvNets)对其他数据集进行图像分类,我们移除了最后一个全连接层(该层执行1000类ILSVRC分类任务),并将倒数第二层产生的4096维激活值作为图像特征。这些特征会在不同位置和尺度上进行聚合处理。最终生成的图像描述符会进行L2L_2L2归一化,并与在目标数据集上训练的线性支持向量机(SVM)分类器结合使用。为简化流程,预训练卷积神经网络的权重保持固定(不进行微调处理)。
Aggregation of features is carried out in a similar manner to our ILSVRC evaluation procedure (Sect. 3.2). Namely, an image is first rescaled so that its smallest side equals QQQ, and then the network is densely applied over the image plane (which is possible when all weight layers are treated as convolutional). We then perform global average pooling on the resulting feature map, which produces a 4096-D image descriptor. The descriptor is then averaged with the descriptor of a horizontally flipped image. As was shown in Sect. 4.2, evaluation over multiple scales is beneficial, so we extract features over several scales QQQ. The resulting multi-scale features can be either stacked or pooled across scales. Stacking allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales; this, however, comes at the cost of the increased descriptor dimensionality. We return to the discussion of this design choice in the experiments below. We also assess late fusion of features, computed using two networks, which is performed by stacking their respective image descriptors.
特征的聚合方式与我们的ILSVRC评估流程(第3.2节)类似。具体而言,首先将图像重新缩放,使其最短边长度等于QQQ,然后在图像平面上密集应用网络(当所有权重层都视为卷积层时,这种操作是可行的)。接着我们对生成的特征图执行全局平均池化,从而产生一个4096维的图像描述符。该描述符随后与水平翻转图像的描述符进行平均处理。如第4.2节所示,多尺度评估能带来性能提升,因此我们提取了多个尺度QQQ下的特征。最终的多尺度特征既可以在不同尺度间堆叠,也可以进行跨尺度池化。堆叠方式允许后续分类器学习如何在不同尺度范围内最优组合图像统计量,但会带来描述符维度增加的代价。我们将在后续实验部分重新讨论这一设计选择。此外,我们还评估了使用两个网络计算特征的后期融合策略,即通过堆叠它们各自的图像描述符来实现。
Image Classification on VOC-2007 and VOC-2012. We begin with the evaluation on the image classification task of PASCAL VOC-2007 and VOC-2012 benchmarks (Everingham et al., 2015). These datasets contain 10K and 22.5K images respectively, and each image is annotated with one or several labels, corresponding to 20 object categories. The VOC organisers provide a pre-defined split into training, validation, and test data (the test data for VOC-2012 is not publicly available; instead, an official evaluation server is provided). Recognition performance is measured using mean average precision (mAP) across classes.
在VOC-2007和VOC-2012数据集上的图像分类。我们首先对PASCAL VOC-2007和VOC-2012基准测试(Everingham等人,2015)中的图像分类任务进行评估。这两个数据集分别包含1万张和2.25万张图像,每张图像都标注有一个或多个标签,对应20个物体类别。VOC组织者提供了预定义的数据划分方式,包括训练集、验证集和测试集(VOC-2012的测试数据未公开;组织方提供了官方评估服务器)。分类性能通过各类别的平均精度均值(mAP)来衡量。
Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking. We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific semantics which a classifier could exploit. Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales: Q∈{256,384,512,640,768}Q ∈ \{256, 384, 512, 640, 768\}Q∈{256,384,512,640,768}. It is worth noting though that the improvement over a smaller range of {256,384,512}\{256, 384, 512\}{256,384,512} was rather marginal (0.3%0.3\%0.3%).
值得注意的是,通过分析VOC-2007和VOC-2012验证集上的表现,我们发现对多尺度计算得到的图像描述符进行均值聚合(averaging)与堆叠聚合(stacking)效果相当。我们推测这是因为VOC数据集中物体以不同尺度出现,因此分类器无法利用特定尺度特有的语义信息。由于均值聚合不会增加描述符维度,我们得以在较大尺度范围Q∈256,384,512,640,768Q ∈ {256, 384, 512, 640, 768}Q∈256,384,512,640,768内进行聚合。但需要说明的是,相较于较小范围256,384,512{256, 384, 512}256,384,512,性能提升幅度较为有限(仅0.30.3%0.3)。
The test set performance is reported and compared with other approaches in Table 11. Our networks “Net-D” and “Net-E” exhibit identical performance on VOC datasets, and their combination slightly improves the results. Our methods set the new state of the art across image representations, pretrained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than 6%6\%6%. It should be noted that the method of Wei et al. (2014), which achieves 1%1\%1% better mAP on VOC-2012, is pre-trained on an extended 2000-class ILSVRC dataset, which includes additional 1000 categories, semantically close to those in VOC datasets. It also benefits from the fusion with an object detection-assisted classification pipeline.
测试集性能已在表11中报告并与其他方法进行对比。我们的网络"Net-D"和"Net-E"在VOC数据集上表现出相同性能,二者的组合略微提升了结果。我们的方法在ILSVRC数据集预训练的图像表征方面创造了新纪录,比Chatfield等人(2014)之前的最佳结果提高了6%6\%6%以上。值得注意的是,Wei等人(2014)的方法在VOC-2012上实现了高出1%1\%1%的mAP,该方法是基于扩展的2000类ILSVRC数据集进行预训练的,该数据集额外包含1000个与VOC数据集语义相近的类别,并且受益于与目标检测辅助分类流程的融合。

表11:在VOC-2007、VOC-2012、Caltech-101和Caltech-256数据集上与现有图像分类最佳方法的对比。我们的模型标记为"VGG"。带有*号的结果是通过在扩展的ILSVRC数据集(2000个类别)上预训练的卷积神经网络实现的。
Image Classification on Caltech-101 and Caltech-256. In this section we evaluate very deep features on Caltech-101 (Fei-Fei et al., 2004) and Caltech-256 (Griffin et al., 2007) image classification benchmarks. Caltech-101 contains 9K images labelled into 102 classes (101 object categories and a background class), while Caltech-256 is larger with 31K images and 257 classes. A standard evaluation protocol on these datasets is to generate several random splits into training and test data and report the average recognition performance across the splits, which is measured by the mean class recall (which compensates for a different number of test images per class). Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class. On Caltech-256 we also generated 3 splits, each of which contains 60 training images per class (and the rest is used for testing). In each split, 20%20\%20% of training images were used as a validation set for hyper-parameter selection.
Caltech-101和Caltech-256图像分类评估。本节我们在Caltech-101(Fei-Fei等人,2004)和Caltech-256(Griffin等人,2007)基准数据集上评估深度特征性能。Caltech-101包含9千张图像,标记为102个类别(101个物体类别和1个背景类别);Caltech-256规模更大,包含3.1万张图像和257个类别。标准评估流程是随机生成若干训练集/测试集划分,并报告各划分的平均识别性能(通过类别平均召回率衡量,该指标可补偿不同类别测试图像数量的差异)。遵循Chatfield等人(2014)、Zeiler & Fergus(2013)及He等人(2014)的方法:在Caltech-101上生成3次随机划分,每次划分每类含30张训练图像和最多50张测试图像;在Caltech-256上同样生成3次划分,每次每类含60张训练图像(剩余用于测试)。每次划分中,20%20\%20%训练图像作为验证集用于超参数选择。
We found that unlike VOC, on Caltech datasets the stacking of descriptors, computed over multiple scales, performs better than averaging or max-pooling. This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are semantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations. We used three scales Q∈{256,384,512}Q ∈ \{256, 384, 512\}Q∈{256,384,512}.
我们发现,与VOC不同,在Caltech数据集上,通过多尺度计算得到的描述符堆叠(stacking)表现优于平均池化或最大池化。这一现象可以解释为:在Caltech图像中,目标通常占据整个画面,因此多尺度图像特征具有不同的语义含义(完整目标 vs 局部细节)。通过堆叠操作,分类器能够有效利用这种尺度特异性表征。实验中我们采用了三个尺度参数Q∈256,384,512Q ∈ {256, 384, 512}Q∈256,384,512。
Our models are compared to each other and the state of the art in Table 11. As can be seen, the deeper 19-layer Net-E performs better than the 16-layer Net-D, and their combination further improves the performance. On Caltech-101, our representations are competitive with the approach of He et al. (2014), which, however, performs significantly worse than our nets on VOC-2007. On Caltech-256, our features outperform the state of the art (Chatfield et al., 2014) by a large margin (8.6%8.6\%8.6%).
我们的模型相互之间以及与现有技术水平在表11中进行了比较。可以看出,更深的19层Net-E比16层Net-D表现更好,而它们的组合进一步提高了性能。在Caltech-101数据集上,我们的表征方法与He等人(2014)的方法相当,但后者在VOC-2007数据集上的表现明显不如我们的网络。在Caltech-256数据集上,我们的特征以较大优势(8.6%8.6\%8.6%)超越了Chatfield等人(2014)的现有最佳结果。
Action Classification on VOC-2012. We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al., 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action. The dataset contains 4.6K training images, labelled into 11 classes. Similarly to the VOC-2012 object classification task, the performance is measured using the mAP. We considered two training settings: (i) computing the ConvNet features on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box, and stacking them to obtain the final representation. The results are compared to other approaches in Table 12.

表12:VOC2012数据集单图像动作分类与现有技术的对比。我们的模型标注为"VGG"。标注*的结果是通过在扩展版ILSVRC数据集(1512个类别)上预训练的卷积神经网络实现的。
VOC-2012行为分类任务评估。我们还在PASCAL VOC-2012行为分类任务(Everingham等人,2015)上评估了性能最佳的图像表征方法(Net-D与Net-E特征的堆叠组合)。该任务要求根据执行动作的人物边界框,从单张图像中预测动作类别。数据集包含4.6千张训练图像,被标记为11个类别。与VOC-2012目标分类任务类似,其性能通过平均精度均值(mAP)进行衡量。我们采用两种训练设置:(i) 在整个图像上计算卷积网络特征,忽略提供的边界框;(ii) 同时在整个图像和提供边界框区域计算特征,并通过堆叠(stacking)获得最终表征。表12展示了与其他方法的对比结果。
Our representation achieves the state of art on the VOC action classification task even without using the provided bounding boxes, and the results are further improved when using both images and bounding boxes. Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features.
我们的表征方法即使不使用提供的边界框,也能在VOC动作分类任务上达到最先进水平,而当同时使用图像和边界框时,结果会进一步提升。与其他方法不同,我们没有采用任何特定任务的启发式方法,而是依赖于非常深的卷积特征的表征能力。
Other Recognition Tasks. Since the public release of our models, they have been actively used by the research community for a wide range of image recognition tasks, consistently outperforming more shallow representations. For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model. Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al., 2014), image caption generation (Kiros et al., 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al., 2014; Bell et al., 2014).
其他识别任务。自我们的模型公开发布以来,它们已被研究界积极用于各种图像识别任务,始终优于较浅层的表征方法。例如,Girshick等人(2014年)通过用我们的16层模型替换Krizhevsky等人(2012年)的卷积网络,取得了最先进的物体检测结果。在语义分割(Long等人,2014年)、图像描述生成(Kiros等人,2014年;Karpathy & Fei-Fei,2014年)以及纹理和材质识别(Cimpoi等人,2014年;Bell等人,2014年)中,也观察到了相对于Krizhevsky等人(2012年)较浅架构的类似提升。
C PAPER REVISIONS
Here we present the list of major paper revisions, outlining the substantial changes for the convenience of the reader.
以下为主要论文修订清单,为方便读者查阅,特此列出重大修改内容。
v1 Initial version. Presents the experiments carried out before the ILSVRC submission.
v1 初始版本。呈现了在提交ILSVRC之前进行的实验。
v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance.
v2 在提交后增加了ILSVRC实验,采用尺度抖动进行训练集增强,从而提升了性能。
v3 Adds generalisation experiments (Appendix B) on PASCAL VOC and Caltech image classification datasets. The models used for these experiments are publicly available.
v3在PASCAL VOC和Caltech图像分类数据集上增加了泛化实验(附录B)。这些实验使用的模型均已公开。
v4 The paper is converted to ICLR-2015 submission format. Also adds experiments with multiple crops for classification.
v4 论文转换为ICLR-2015提交格式。同时增加了多裁剪分类实验。
v6 Camera-ready ICLR-2015 conference paper. Adds a comparison of the net B with a shallow net and the results on PASCAL VOC action classification benchmark.
v6 最终版ICLR-2015会议论文。增加了网络B与浅层网络的对比以及在PASCAL VOC动作分类基准上的结果。

897

被折叠的 条评论
为什么被折叠?



