Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
使用卷积层计算proposals
On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid.
与之前的金字塔相比,作者引入来 anchor(锚),作为不同scales和aspect ratios的参考。
Training scheme: alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed.
model
Faster R-CNN
Region Proposal Networks
A RPN takes an image as input and outputs a set of rectangular object proposals, each with an objectness score.
为了产生region proposal,we slide a small network over the convolutional feature map output by the last shared convolutional layer.
This small network takes as input an n*n spatial window of the input convolutional feature map.
Each sliding window is mapped to a lower-dimensional feature.
This feature if fed into two sibling fc layers - a box-regression layer and a box-classification layer.
We use n=3 in this paper.
Anchors
At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k.
An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position.
Translation-Invariant Anchors
anchors 和 function that compute proposal relative to the anchors都是translation-invariant
Multi-Scale Anchors as Regression References
Our method is built on a pyramid of anchors
Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios.
It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size.
Loss Function
we assign a binary class label (of being an object or not) to each anchor.
t i is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t ∗ is that of the i ground-truth box associated with a positive anchor.
Training RPNs
It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate.
Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1.