r-cnn reading sum(pattern recognition)
r-cnn
[source code](http://www.cs.berkeley.edu/ ̃rbg/rcnn)
Abstract
Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features.
Two key insights:(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task,followed by domain-specific fine-tuning, yields a significant performance boost.
Introduction
Purpose
Before this: Recognition mainly based on SIFT and HOG.
Focus on two problems: localizing objects with deep network and training a high-capacity(higher than sift or hog-like features) model with only a small quantity of annotated detection data.
Detection before that:
1. treat it as a regression problem.(bad practical effects)
2. to build a sliding-window detector. (sliding window size in fixed therefore cannot accurate localization).
Method is this paper:
1. using the “recognition using regions” paradigm.
2. supervised pre-training on a large auxiliary dateset(ILSVRC), followed by domain specific fine-tuning on a small dataset(PASCAL) is an effective paradigm for learning high-capacity CNNs when data is scarce.
R-CNN: Regions with CNN features:
1. Input image
2. Entract region proposals(~2K)
3. Compute CNN features(conv and pool)
4. Classify regions with SVM
Other features:
1. Computation system is simple than previous region features.
2. Using detection analysis tool to show that a simple bounding box regression method could improve the model.
3. R-CNN works well on semantic segmentation problems.
Object detecion with R-CNN
Procedure:
Our object detection system consists of three modules.The first generates category-independent region proposals.These proposals define the set of candidate detections avail-
able to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of class-specific linear SVMs.
Module in details
- Region proposals: generating category-independent region proposals.(here using selective search method to obtain proposals)
- Feature extraction: 4096-dimensional feature vector from each region proposal using CNN.(5 conv layers and 2 fully connected layers) Before input into net, proposal region was wrapped in tight bounding box. Before wrapping the image, images need to be dilated.(防止形状失真的太厉害?)
- Score each extracted feature vector using SVM trained.(也可以用softmax) Apply a greedy non-maximum suppression to judge.(When a proposal has an intersention-over-union “IoU” overlap with a higher scoring proposal larger than a threshold, the prior proposal is rejected)(这里应该就是初步筛选掉一些有覆盖的,覆盖到高分proposal的)
这种方法比其他方案快是因为:1.权值对所有图片共享。2.网络全连接层输出的维度很低—4096,与当时的其他检测方法相比。(然而还是用了十几秒在GPU上做一次propose和feature extraction)
Net training
Supervised pre-training
ILSVRC 2012上做的预训练,训出的是1000类的普通分类模型。
Supervised Domain-specific fine-tuning
使用VOC数据集在预训练基础上继续训练,采用SGD training.在网络结构上这里只是把之前的最后1000维预测输出换成了随机初始化的21维输出,因为VOC有20类再加上背景算1类。学习率为预训练的1/10,防止彻底打乱了预训练的结果。
每次迭代的输入都是一个batch size为128的数据,其中32个positive window, 96个background window。
* to be continued… *