ImageNet Classification with Deep Convolutional NeuralNetwork
利用深度卷积神经网络进行ImageNet分类
Abstract
We trained a large, deep convolutionalneural network to classify the 1.2 million high-resolution images in theImageNet LSVRC-2010 contest into the 1000 different classes. On the test data,we achieved top-1 and top-5 error rates of 37.5% and 17.0% which isconsiderably better than the previous state-of-the-art. The neural network,which has 60 million parameters and 650,000 neurons, consists of fiveconvolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with afinal 1000-way softmax. To make training faster, we used non-saturating neuronsand a very efficient GPU implementation of the convolution operation. To reduceoverfitting in the fully-connected layers we employed a recently-developedregularization method called “dropout” that proved to be very effective. Wealso entered a variant of this model in the ILSVRC-2012 competition andachieved a winning top-5 test error rate of 15.3%, compared to 26.2% achievedby the second-best entry.
摘要
我们训练了一个又大又深的卷积神经网络,来分类120万张图片,这些图都是从Imagenet 2010 竞赛库中的,1000个类别。在测试数据集上我们的第一判别错误率为37.5%,前五准确率17%,,比第二名要好很多。神经网络有着6000万的参数,和650000个神经元,由5个卷基层,其中一些附加上max-pooling层,3个全连阶层,和1000维的softmax层。为了使得训练更加快速我们使用不会饱和的神经元,以及高效的GPU来实现卷积,为了减少全连接层的过拟合,我们使用了一个最近较多使用的策略“dropout”,这种方法可以很高效。我们还参与了好多的ILSVRC 2012 的比赛并取得了前五错误5.3%,与第二名26.2%相比。
1 Introduction
Current approaches to object recognitionmake essential use of machine learning methods. To improve their performance,we can collect larger datasets, learn more powerful models, and use bettertechniques for preventing overfitting. Until recently, datasets of labeled imageswere relatively small — on the order of tens of thousands of images (e.g., NORB[16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]). imple recognition taskscan be solved quite well with datasets of this size, especially if they areaugmented with label-preserving transformations. For example, the currentbesterror rate on the MNIST digit-recognition task (<0.3%) approaches humanperformance [4]. But objects in realistic settings exhibit considerablevariability, so to learn to recognize them it is necessary to use much largertraining sets. And indeed, the shortcomings of small image datasets have beenwidely recognized (e.g., Pinto et al. [21]), but it has only recently becomepossible to collect labeled datasets with millions of images. The new largerdatasets include LabelMe [23], which consists of hundreds of thousands offully-segmented images, and ImageNet [6], which consists of over 15 millionlabeled high-resolution images in over 22,000 categories.
1介绍
现如今的对象识别方法都必不可少使用了机器学习方法。为了改善效果,我们收集了大量数据,学习了更加强大的模型,并使用了更好的技术来保证不会过拟合。直到最近,标注的数据集还是很小,大概是上万张左右(NORB,Caltech-101/256,Cifar-10/100),简单的识别可以用这种量级的数据集,特别是当他们在固定的标签条件下进行一定的图像转换,例如最近的最好的错误率达到(0.3%),接近人类水平。但是物体在现实世界出现时,会发生各种变化,所以学习如何识别他们时必须使用更大的训练集的,并且,的确大家已经很广泛地认识到小数据集的缺点,(pinto等人)但是直到最近,大型百万量级带标注数据集才成为可能,这些新数据集包括Labelme(包含十万张左右全分割的图像),还有Imagenet,包括了150万高分辨率标注图像,总共22000个类别
To learn about thousands of objects frommillions of images, we need a model with a large learning
capacity. However, the immense complexityof the object recognition task means that this problem cannot be specified evenby a dataset as large as ImageNet, so our model should also have lots
of prior knowledge to compensate for allthe data we don’t have. Convolutional neural networks
(CNNs) constitute one such class of models[16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying theirdepth and breadth, and they also make strong and mostly correct assumptions aboutthe nature of images (namely, stationarity of statistics and locality of pixeldependencies).Thus, compared to standard feedforward neural networks withsimilarly-sized layers, CNNs havemuch fewer connections and parameters and sothey are easier to train, while their theoretically-best performance is likelyto be only slightly worse.
为了从百万级别的数据集学习上千个对象,我们需要一个大体量的模型。然而大规模的识别任务意味着,这个问题不能单独对待,即使在IMagenet这种大数据集合上,所以我们的模型需要先验知识来补偿我们所无法获取的数据。卷积神经网络就是我们需要的模型,它的体量可以通过调节深度和广度来控制,并且它还能对自然的图像进行很好的建模(很可能是正确的),例如稳定的统计学原理,还有局部像素之间的关联,因此,与标准的层节点数量相似反馈神经网络相比,CNN有更少的连接和参数,所以更容易训练,当然,我们仍不知道它理论上的最好的表现,可能是个微小的缺陷
Despite the attractive qualities of CNNs,and despite the relative efficiency of their local architecture,they have stillbeen prohibitively expensive to apply in large scale to high-resolution images.Luckily, current GPUs, paired with a highly-optimized implementation of 2Dconvolution, are powerful enough to facilitate the training ofinterestingly-large CNNs, and recent datasets such as ImageNet contain enoughlabeled examples to train such models without severe overfitting. The specificcontributions of this paper are as follows: we trained one of the largestconvolutional neural networks to date on the subsets of ImageNet used in theILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the bestresults ever reported on these datasets. We wrote a highly-optimized GPUimplementation of 2D convolution and all the other operations inherent in trainingconvolutional neural networks, which we make available publicly.1. Our networkcontains a number of new and unusual features which improve its performance andreduce its training time, which are detailed in Section 3.
尽管CNN有着很多吸引人的特质,尽管它们有相对高效的局部结构,他们还是需要耗费很昂贵的代价来在大型高分辨率数据集上进行训练,幸运的是,现在的GPU,有着高度优化的2维卷积功能,能足够强大来实现大型CNN网络的训练,并且最近的数据集例如Imagenet包含足够的标注好的样本,训练时可以有效防止过拟合。这篇文章的主要贡献点在于:
1我们训练一个可能是ILSVRC-2010和2012竞赛上最大的CNN,并且取得了遥遥领先的最好成绩。2我们编写了一个高度优化的GPU二维卷积实现方法,以及其他相关的训练卷积神经网络所需的方法,并进行了公开.我们的神经网络包含许多新的不寻常的特征,可以改善效果和减少训练时间,细节在第三部分讨论。
The size of our network made overfitting asignificant problem, even with 1.2 million labeled training examples, so weused several effective techniques for preventing overfitting, which aredescribed in Section 4. Our final network contains five convolutional and threefully-connected layers, and this depth seems to be important: we found thatremoving any convolutional layer (each of which contains no more than 1% of themodel’s parameters) resulted in inferior performance.
In the end, the network’s size is limitedmainly by the amount of memory available on current GPUs and by the amount oftraining time that we are willing to tolerate. Our network takes between five andsix days to train on two GTX 580 3GB GPUs. All of our experiments suggest thatour results can be improved simply by waiting for faster GPUs and biggerdatasets to become available.
我们网络的大小使得过拟合成为一个值得重视的问题,即使120万张图像。所以我们使用一些高效的技术防止过拟合,在第四节会进行讨论,我们的最终网络包含5层卷积层和三层全连接层,这个深度看起来很重要,我们发现移除任何一个卷基层(每隔都包含不多于1%的参数量)都会导致效果变差。
最后,网络的大小被限制,主要因为现在GPU的内存不够,为了速度考虑我们就忍了这一点,我们的网络模型使用了5-6天在两块GTX580 3GB GPU上训练,所有的我们的实现都现实我们的结果可以在更快的GPU和更大的数据集上得到提升
2 The Dataset
ImageNet is a dataset of over 15 millionlabeled high-resolution images belonging to roughly 22,000categories. Theimages were collected from the web and labeled by human labelers using Amazon’sMechanical Turk crowd-sourcing tool. Starting in 2010, as part of the PascalVisual Object Challenge, an annual competition called the ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset ofImageNet with roughly 1000 images in each of 1000 categories. In all, there areroughly 1.2 million training images, 50,000 validation images, and 150,000testing images.ILSVRC-2010 is the only version of ILSVRC for which the test setlabels are available, so this is the version on which we performed most of ourexperiments. Since we also entered our model in the ILSVRC-2012 competition, inSection 6 we report our results on this version of the dataset as well, forwhich test set labels are unavailable. On ImageNet, it is customary to reporttwo error rates: top-1 and top-5, where the top-5 error rate is the fraction oftest images for which the correct label is not among the five labels consideredmost probable by the model.
2 数据集
Imagenet是一个超过150万数据量的数据库,包含22000类。数据是从网上收集,人工标注,使用亚马逊的亚马逊土耳其机器人(Amazon Mechanical Turk)。从20