论文笔记|Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

最新推荐文章于 2021-11-15 15:20:05 发布

原创

最新推荐文章于 2021-11-15 15:20:05 发布 · 1.6k 阅读

2 ·

CC 4.0 BY-SA版权

Authors

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun
这里写图片描述
何恺明

Abstract

Existing CNNs require a fixed-size input image.This requirement may reduce the recongnition accuracy for the image or sub-image of an arbitray size/scale. In this work, SPP-net can generate a fixed-length represention regardless of image szie/scale. In the object detection respect, this method aviods repeatedly computing the convolutional features, and 24-102x faster than R-CNN.

1 Introduction

A CNN mainly consists of two parts: convolutional layers and fully-connected layers. (btw output feature maps represent the spatial arrangement of the activations) it is fixed-size/length is needed by the fc layers rather than the convolutional layers. In this paper , an SPP layer was added on the top of the last convolutional layer .
SPP,as an extension of the BoW, it has some remarkable properties for CNNs:
1. generates a fixed-length
2. uses multi-level spatial bins, robust
3. can pool features extracted at variable scales

# for spp and wob
K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in
ICCV, 2005.
 S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.

SPPnet allows us to feed images with varying sizes, and experiments show that multi-size training converges just as the traditional single-size training, and lead to better testing accuracy.

SPP impoves four different CNN architectures,it might improve more sophisticated convolutinal architectures.

About object detection, we can run the convolutional layers only once and then extract features on the feature maps by spp-net. with the recent fast proposal method of EdgeBoxes, our system takes 0.5 seconds processing an image.

2 Deep networks with spp

2.1 feature map

the convolutinal layers use sliding filters, and their outputs have roughly the same aspect ratio as the inputs,so the feature map involve not only the strength of the responses, but also their spatial positions.

2.2 trainin the network

2.2.1 single-size training

spp layer windowsize= $\lceil a/n \rceil$ , stride = $\lfloor a/n \rfloor$

2.2.2 multi-size training

we consider a set of predefined sizes(180x180,224,224),we resize the 224x224 region to 180 x 180

Note that the above single/ multi-size solutions are for training only, at the testing stage, it is straightforward to apply on images of any size.

3 SPP-net for image classification

3.1 Experiments on ImageNet 2012 classification

3.1.1 baseline architectures

the advangetages of spp are independent of the cnn architectures used. we show spp improves the accuracy of all these 4 architectures in table 1.
这里写图片描述

3.1.2 multi-level pooling improves accuracy

In table 2 we show the results using single-size training with the spp layer replace the previous regular pooling layer.

It is worth noticing that the gain of multi-level pooling is brought by the it’s robust, which to the variance in object deformations and spatial layout.
这里写图片描述

3.1.3 multi-size training improves accuracy

Table 2 (c) shows the multi-size training.

3.1.4 full-image representations improve accuracy

full-image: min(w,h)=256
single view: center 224x224 crop
The comparisions are in table 3, and it shows that even though our network is trained using square images only, it generalizes well to other aspect rations:
这里写图片描述
comparing Table2 and 3 ,we find that the combination of multiple view is substantially better than the single full-image view.
Good merits of full-image representations:
1. even for the combination of dozens of views, 2 additional full-image views(flipping) can still boost the accuracy by about 0.2%
2. full-image view is methodologically consistent with the traditional methods.
3. in other applications such as image retrieval an image representation, rather than a classification score , is required for similarity ranking. A full-image representation can be preferred.

# image retrieval
H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez,

最低0.47元/天解锁文章

200万优质内容无限畅学