Authors
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
何恺明
Abstract
Existing CNNs require a fixed-size input image.This requirement may reduce the recongnition accuracy for the image or sub-image of an arbitray size/scale. In this work, SPP-net can generate a fixed-length represention regardless of image szie/scale. In the object detection respect, this method aviods repeatedly computing the convolutional features, and 24-102x faster than R-CNN.
1 Introduction
A CNN mainly consists of two parts: convolutional layers and fully-connected layers. (btw output feature maps represent the spatial arrangement of the activations) it is fixed-size/length is needed by the fc layers rather than the convolutional layers. In this paper , an SPP layer was added on the top of the last convolutional layer .
SPP,as an extension of the BoW, it has some remarkable properties for CNNs:
1. generates a fixed-length
2. uses multi-level spatial bins, robust
3. can pool features extracted at variable scales
# for spp and wob
K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in
ICCV, 2005.
S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.
SPPnet allows us to feed images with varying sizes, and experiments show that multi-size training converges just as the traditional single-size training, and lead to better testing accuracy.
SPP impoves four different CNN architectures,it might improve more sophisticated convolutinal architectures.
About object detection, we can run the convolutional layers only once and then extract features on the feature maps by spp-net. with the recent fast proposal method of EdgeBoxes, our system takes 0.5 seconds processing an image.
2 Deep networks with spp
2.1 feature map
the convolutinal layers use sliding filters, and their outputs have roughly the same aspect ratio as the inputs,so the feature map involve not only the strength of the responses, but also their spatial positions.
2.2 trainin the network
2.2.1 single-size training
spp layer windowsize= ⌈a/n⌉ , stride = ⌊a/n⌋
2.2.2 multi-size training
we consider a set of predefined sizes(180x180,224,224),we resize the 224x224 region to 180 x 180
Note that the above single/ multi-size solutions are for training only, at the testing stage, it is straightforward to apply on images of any size.
3 SPP-net for image classification
3.1 Experiments on ImageNet 2012 classification
3.1.1 baseline architectures
the advangetages of spp are independent of the cnn architectures used. we show spp improves the accuracy of all these 4 architectures in table 1.
3.1.2 multi-level pooling improves accuracy
In table 2 we show the results using single-size training with the spp layer replace the previous regular pooling layer.
It is worth noticing that the gain of multi-level pooling is brought by the it’s robust, which to the variance in object deformations and spatial layout.
3.1.3 multi-size training improves accuracy
Table 2 (c) shows the multi-size training.
3.1.4 full-image representations improve accuracy
full-image: min(w,h)=256
single view: center 224x224 crop
The comparisions are in table 3, and it shows that even though our network is trained using square images only, it generalizes well to other aspect rations:
comparing Table2 and 3 ,we find that the combination of multiple view is substantially better than the single full-image view.
Good merits of full-image representations:
1. even for the combination of dozens of views, 2 additional full-image views(flipping) can still boost the accuracy by about 0.2%
2. full-image view is methodologically consistent with the traditional methods.
3. in other applications such as image retrieval an image representation, rather than a classification score , is required for similarity ranking. A full-image representation can be preferred.
# image retrieval
H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and
C. Schmid, “Aggregating local image descriptors into compact
codes,” TPAMI, vol.