不同于图像分类,在物体检测任务中一幅图片上出现的目标数量和大小是任意的;与之相矛盾的是全连接层只能接受固定大小的输入。R-CNN首先提取建议区域(约2000),裁剪缩放到固定大小;然后将所有候选区域送入卷积网络进行分类和回归。显然,以上做法是非常耗时且低效的。候选区域之间存在大量重叠,这意味着提取的卷积特征包含大量冗余。
借助于RoI pooling,Fast R-CNN可以复用卷积特征。考虑到所有候选区域均位于一张图片上,Fast R-CNN对图像整体进行卷积。RoI pooling为每个候选区域提取固定大小的特征图。R-CNN在原始图像上进行裁剪,而Fast R-CNN在特征图上进行裁剪。
Fast R-CNN的网络结构如下图所示
源自Region of interest pooling in TensorFlow – example
Fast R-CNN实现了特征图共享,特征抽取仅需计算一次。RoI pooling进行自适应池化以利于后续网络的判别。
RoI pooling
兴趣区域池化(RoI pooling)是深度学习物体检测算法中常用的一种操作。其目的是对不等尺寸的输入执行最大汇集以获得固定尺寸的特征映射。
兴趣区域池化根据候选区域裁剪卷积特征图,然后用插值(通常是双线性的)将每个裁剪调整为固定大小(14×14×convdepth)。裁剪之后,用 2x2 核大小的最大池化来获得每个建议最终的 7×7×convdepth 特征图。
RoI pooling层需要两个输入:
- 特征提取网络(堆叠若干卷积和最大池层)馈送的特征映射。
- 感兴趣区域列表( N × 5 N×5 N×5 矩阵,其中 N N N 是RoI的数量)。
列表的第一列表示图像索引,其余四个是该区域左上角和右下角的坐标:
[
i
b
a
t
c
h
,
x
m
i
n
,
y
m
i
n
,
x
m
a
x
,
y
m
a
x
]
[i_{batch}, x_{min}, y_{min}, x_{max}, y_{max}]
[ibatch,xmin,ymin,xmax,ymax]
RoI pooling 实际在做什么呢?对于输入列表中的每个感兴趣区域,它截取特征图的相应部分,并将其缩放到预定义的大小(例如7×7)。缩放通过以下方式完成:
- 将候选区域划分为相同大小的部分(其数量与输出的维度相同);
- 逐通道在每个部分中查找元素最大值;
- 将这些最大值复制到输出缓冲区。
结果是,由一组大小各异的矩形,我们快速获取到具有固定大小的相应特征图。值得注意的是,RoI pooling 输出的维度实际上并不取决于输入特征图的大小,也不取决于区域提案的大小。这完全取决于我们将区域分成几部分。
Caffe2中的RoIPoolOp
获取参数is_test_
、order_
、pooled_height_
、pooled_width_
、spatial_scale_
。
StringToStorageOrder
和GetSingleArgument
RoIPoolOp(const OperatorDef& operator_def, Workspace* ws)
: Operator<Context>(operator_def, ws),
is_test_(OperatorBase::GetSingleArgument<int>(OpSchema::Arg_IsTest, 0)),
order_(StringToStorageOrder(
OperatorBase::GetSingleArgument<string>("order", "NCHW"))),
pooled_height_(OperatorBase::GetSingleArgument<int>("pooled_h", 1)),
pooled_width_(OperatorBase::GetSingleArgument<int>("pooled_w", 1)),
spatial_scale_(
OperatorBase::GetSingleArgument<float>("spatial_scale", 1.)) {
CAFFE_ENFORCE(
(is_test_ && OutputSize() == 1) || (!is_test_ && OutputSize() == 2),
"Output size mismatch.");
CAFFE_ENFORCE_GT(spatial_scale_, 0);
CAFFE_ENFORCE_GT(pooled_height_, 0);
CAFFE_ENFORCE_GT(pooled_width_, 0);
CAFFE_ENFORCE_EQ(
order_, StorageOrder::NCHW, "Only NCHW order is supported right now.");
}
RoIPoolOp<float, CPUContext>::RunOnDevice()
输入为特征图和RoI。A
记录所选取最大值所对应的索引,只在训练中用到。
const auto& X = Input(0); // Input data to pool
const auto& R = Input(1); // RoIs
auto* Y = Output(0); // RoI pooled data
auto* A = is_test_ ? nullptr : Output(1); // argmaxes
ROI的格式为[batch_index, x1, y1, x2, y2],所以维度为 N × 5 N \times 5 N×5 。
// Each ROI is of the form [batch_index x1 y1 x2 y2]
CAFFE_ENFORCE_EQ(R.dim32(1), 5);
获取特征图信息和ROI数量。
// TODO: Handle the storage_order properly to get the NCWH.
int batch_size = X.dim32(0);
int channels = X.dim32(1);
int height = X.dim32(2);
int width = X.dim32(3);
int num_rois = R.dim32(0);
设置输出的形状。
Y->Resize(num_rois, channels, pooled_height_, pooled_width_);
if (!is_test_) {
A->Resize(Y->dims());
}
获取数据指针。
const float* Xdata = X.data<float>();
const float* rois = R.data<float>();
float* Ydata = Y->mutable_data<float>();
int* argmax_data = is_test_ ? nullptr : A->mutable_data<int>();
for循环处理每一个ROI。
获取ROI的批索引及包围盒范围。
// For each ROI R = [batch_index y1 x1 y2 x2]: max pool over R
for (int n = 0; n < num_rois; ++n) {
int roi_batch_id = rois[0];
int roi_start_w = round(rois[1] * spatial_scale_);
int roi_start_h = round(rois[2] * spatial_scale_);
int roi_end_w = round(rois[3] * spatial_scale_);
int roi_end_h = round(rois[4] * spatial_scale_);
CAFFE_ENFORCE_GE(roi_batch_id, 0);
CAFFE_ENFORCE_LT(roi_batch_id, batch_size);
将畸形的ROI强转为1x1。
// Force malformed ROIs to be 1x1
int roi_height = max(roi_end_h - roi_start_h + 1, 1);
int roi_width = max(roi_end_w - roi_start_w + 1, 1);
计算bin大小,这里用的是float型。
const float bin_size_h =
static_cast<float>(roi_height) / static_cast<float>(pooled_height_);
const float bin_size_w =
static_cast<float>(roi_width) / static_cast<float>(pooled_width_);
获取对应的特征图。
const float* batch_data = Xdata + roi_batch_id * X.size_from_dim(1);
ROI会作用于特征图的每个通道。
选取bin中像素时向外侧取整。
for (int c = 0; c < channels; ++c) {
for (int ph = 0; ph < pooled_height_; ++ph) {
for (int pw = 0; pw < pooled_width_; ++pw) {
// Compute pooling region for this output unit:
// start (included) = floor(ph * roi_height / pooled_height_)
// end (excluded) = ceil((ph + 1) * roi_height / pooled_height_)
int hstart =
static_cast<int>(floor(static_cast<float>(ph) * bin_size_h));
int wstart =
static_cast<int>(floor(static_cast<float>(pw) * bin_size_w));
int hend =
static_cast<int>(ceil(static_cast<float>(ph + 1) * bin_size_h));
int wend =
static_cast<int>(ceil(static_cast<float>(pw + 1) * bin_size_w));
添加ROI偏移量并缩减到输入边界。
// Add roi offsets and clip to input boundaries
hstart = min(max(hstart + roi_start_h, 0), height);
hend = min(max(hend + roi_start_h, 0), height);
wstart = min(max(wstart + roi_start_w, 0), width);
wend = min(max(wend + roi_start_w, 0), width);
pool_index
为池化结果的索引。
const int pool_index = ph * pooled_width_ + pw;
如果池化区域为空则设置结果为0,否则设置为负向最大以利于比较。argmax_data
的默认值为-1。
// Define an empty pooling region to be zero
bool is_empty = (hend <= hstart) || (wend <= wstart);
Ydata[pool_index] = is_empty ? 0 : -FLT_MAX;
if (!is_test_) {
// If nothing is pooled, argmax = -1 causes nothing to be backprop'd
argmax_data[pool_index] = -1;
}
取出池化区域的最大值并记录其索引。
for (int h = hstart; h < hend; ++h) {
for (int w = wstart; w < wend; ++w) {
const int index = h * width + w;
if (batch_data[index] > Ydata[pool_index]) {
Ydata[pool_index] = batch_data[index];
if (!is_test_) {
argmax_data[pool_index] = index;
}
}
}
}
}
}
切换至下一通道。
// Increment all data pointers by one channel
batch_data += X.size_from_dim(2);
Ydata += Y->size_from_dim(2);
if (!is_test_) {
argmax_data += A->size_from_dim(2);
}
}
切换至下一个ROI。
// Increment ROI data pointer
rois += R.size_from_dim(1);
}
RoIPoolOp<float, CUDAContext>::RunOnDevice()
获取输入输出blob。
template <>
bool RoIPoolOp<float, CUDAContext>::RunOnDevice() {
auto& X = Input(0); // Input data to pool
auto& R = Input(1); // RoIs
auto* Y = Output(0); // RoI pooled data
auto* A = is_test_ ? nullptr : Output(1); // argmaxes
处理RoI为空的情况。
// Handle empty rois
if (R.size() == 0) {
Y->Resize(0, X.dim32(1), pooled_height_, pooled_width_);
// mutable_data calls are needed to allocate the tensors
Y->mutable_data<float>();
if (!is_test_) {
A->Resize(Y->dims());
A->mutable_data<int>();
}
return true;
}
设置输出R
、A
的形状。
Y->Resize(R.dim32(0), X.dim32(1), pooled_height_, pooled_width_);
if (!is_test_) {
A->Resize(Y->dims());
}
int output_size = Y->size();
int* argmax_data = is_test_ ? nullptr : A->mutable_data<int>();
调用kernel函数ROIPoolForward
ROIPoolForward<float><<<
CAFFE_GET_BLOCKS(output_size),
CAFFE_CUDA_NUM_THREADS,
0,
context_.cuda_stream()>>>(
output_size,
X.data<float>(),
spatial_scale_,
X.dim32(1),
X.dim32(2),
X.dim32(3),
pooled_height_,
pooled_width_,
R.data<float>(),
Y->mutable_data<float>(),
argmax_data);
return true;
}
ROIPoolForward
推断出线程所对应的输出元素,即每个线程处理输出上的一个点。
template <typename T>
__global__ void ROIPoolForward(
const int nthreads,
const T* bottom_data,
const T spatial_scale,
const int channels,
const int height,
const int width,
const int pooled_height,
const int pooled_width,
const T* bottom_rois,
T* top_data,
int* argmax_data) {
CUDA_1D_KERNEL_LOOP(index, nthreads) {
// (n, c, ph, pw) is an element in the pooled output
int pw = index % pooled_width;
int ph = (index / pooled_width) % pooled_height;
int c = (index / pooled_width / pooled_height) % channels;
int n = index / pooled_width / pooled_height / channels;
计算RoI的坐标范围。
const T* offset_bottom_rois = bottom_rois + n * 5;
int roi_batch_ind = offset_bottom_rois[0];
int roi_start_w = roundf(offset_bottom_rois[1] * spatial_scale);
int roi_start_h = roundf(offset_bottom_rois[2] * spatial_scale);
int roi_end_w = roundf(offset_bottom_rois[3] * spatial_scale);
int roi_end_h = roundf(offset_bottom_rois[4] * spatial_scale);
计算每个bin的大小。以上两步每个并发线程都进行了计算,但似乎又是难以避免的。
// Force malformed ROIs to be 1x1
int roi_width = max(roi_end_w - roi_start_w + 1, 1);
int roi_height = max(roi_end_h - roi_start_h + 1, 1);
T bin_size_h = static_cast<T>(roi_height) / static_cast<T>(pooled_height);
T bin_size_w = static_cast<T>(roi_width) / static_cast<T>(pooled_width);
获得bin的相对起止坐标。
int hstart = static_cast<int>(floor(static_cast<T>(ph) * bin_size_h));
int wstart = static_cast<int>(floor(static_cast<T>(pw) * bin_size_w));
int hend = static_cast<int>(ceil(static_cast<T>(ph + 1) * bin_size_h));
int wend = static_cast<int>(ceil(static_cast<T>(pw + 1) * bin_size_w));
添加RoI框偏移并进行检查。
// Add roi offsets and clip to input boundaries
hstart = min(max(hstart + roi_start_h, 0), height);
hend = min(max(hend + roi_start_h, 0), height);
wstart = min(max(wstart + roi_start_w, 0), width);
wend = min(max(wend + roi_start_w, 0), width);
bool is_empty = (hend <= hstart) || (wend <= wstart);
找出最大元素及索引。
// Define an empty pooling region to be zero
T maxval = is_empty ? 0 : -FLT_MAX;
// If nothing is pooled, argmax = -1 causes nothing to be backprop'd
int maxidx = -1;
const T* offset_bottom_data =
bottom_data + (roi_batch_ind * channels + c) * height * width;
for (int h = hstart; h < hend; ++h) {
for (int w = wstart; w < wend; ++w) {
int bottom_index = h * width + w;
if (offset_bottom_data[bottom_index] > maxval) {
maxval = offset_bottom_data[bottom_index];
maxidx = bottom_index;
}
}
}
写入到输出数据。
top_data[index] = maxval;
if (argmax_data) {
argmax_data[index] = maxidx;
}
}
}
OPERATOR_SCHEMA(RoIPool)
NumOutputs可以设定输出数量的范围。
// Input: X, rois
// Output case #1: Y, argmaxes (train mode)
// Output case #2: Y (test mode)
OPERATOR_SCHEMA(RoIPool)
.NumInputs(2)
.NumOutputs({1, 2})
TensorInferenceFunction会设置tensor_inference_function_
,进而InferTensor会进行调用。
lambda表达式根据输入的形状设置输出的形状。
.TensorInferenceFunction([](const OperatorDef& def,
const vector<TensorShape>& in) {
ArgumentHelper helper(def);
const StorageOrder order = StringToStorageOrder(
helper.GetSingleArgument<string>("order", "NCHW"));
const TensorShape& X = in[0];
const int num_channels =
(order == StorageOrder::NCHW ? X.dims(1) : X.dims(3));
const TensorShape& R = in[1];
const int num_rois = R.dims(0);
const int pooled_height = helper.GetSingleArgument<int>("pooled_h", 1);
const int pooled_width = helper.GetSingleArgument<int>("pooled_w", 1);
TensorShape Y = CreateTensorShape(
vector<int>({num_rois, num_channels, pooled_height, pooled_width}),
X.data_type());
bool is_test = helper.GetSingleArgument<int>(OpSchema::Arg_IsTest, 0);
if (!is_test) {
TensorShape argmaxes = Y;
argmaxes.set_data_type(TensorProto_DataType_INT32);
return vector<TensorShape>({Y, argmaxes});
} else {
return vector<TensorShape>({Y});
}
})
.SetDoc(R"DOC(
Carries out ROI Pooling for Faster-RCNN.
Depending on the mode, there are multiple output cases:
Output case #1: Y, argmaxes (train mode)
Output case #2: Y (test mode)
)DOC")
.Arg(
"is_test",
"If set, run in test mode and skip computation of argmaxes (used for "
"gradient computation). Only one output tensor is produced. "
"(Default: false).")
.Arg("order", "A StorageOrder string (Default: \"NCHW\").")
.Arg("pooled_h", "The pooled output height (Default: 1).")
.Arg("pooled_w", "The pooled output width (Default: 1).")
.Arg(
"spatial_scale",
"Multiplicative spatial scale factor to translate ROI coords from "
"their input scale to the scale used when pooling (Default: 1.0).")
.Input(
0,
"X",
"The input 4-D tensor of data. Only NCHW order is currently supported.")
.Input(
1,
"rois",
"RoIs (Regions of Interest) to pool over. Should be a 2-D tensor of "
"shape (num_rois, 5) given as [[batch_id, x1, y1, x2, y2], ...].")
.Output(
0,
"Y",
"RoI pooled output 4-D tensor of shape "
"(num_rois, channels, pooled_h, pooled_w).")
.Output(
1,
"argmaxes",
"Argmaxes corresponding to indices in X used for gradient computation. "
"Only output if arg \"is_test\" is false.");
RoIPoolGradientOp<float, CUDAContext>::RunOnDevice()
获得输入和输出数据。
auto& X = Input(0); // Input data to pool
auto& R = Input(1); // RoIs
auto& A = Input(2); // argmaxes
auto& dY = Input(3); // Gradient of net w.r.t. output of "forward" op
// (aka "gradOutput")
auto* dX = Output(0); // Gradient of net w.r.t. input to "forward" op
// (aka "gradInput")
dX->ResizeLike(X);
重置梯度 dX
。
// Must zero-out dX before accumulating gradients
math::Set<float, CUDAContext>(
dX->size(), 0.f, dX->mutable_data<float>(), &context_);
如果存在ROI,调用ROIPoolBackward。
if (dY.size() > 0) { // Handle possibly empty gradient if there were no rois
ROIPoolBackward<float><<<
CAFFE_GET_BLOCKS(dY.size()),
CAFFE_CUDA_NUM_THREADS,
0,
context_.cuda_stream()>>>(
dY.size(),
dY.data<float>(),
A.data<int>(),
R.dim32(0),
spatial_scale_,
X.dim32(1),
X.dim32(2),
X.dim32(3),
pooled_height_,
pooled_width_,
dX->mutable_data<float>(),
R.data<float>());
}
return true;
}
ROIPoolBackward
CUDA_1D_KERNEL_LOOP定义了for循环的控制结构。
template <typename T>
__global__ void ROIPoolBackward(
const int nthreads,
const T* top_diff,
const int* argmax_data,
const int num_rois,
const T spatial_scale,
const int channels,
const int height,
const int width,
const int pooled_height,
const int pooled_width,
T* bottom_diff,
const T* bottom_rois) {
CUDA_1D_KERNEL_LOOP(index, nthreads) {
推断出处理线程所对应的池化元素。
// (n, c, ph, pw) is an element in the pooled output
int pw = index % pooled_width;
int ph = (index / pooled_width) % pooled_height;
int c = (index / pooled_width / pooled_height) % channels;
int n = index / pooled_width / pooled_height / channels;
获取梯度及池化的元素索引。
const T* offset_bottom_rois = bottom_rois + n * 5;
int roi_batch_ind = offset_bottom_rois[0];
int bottom_offset = (roi_batch_ind * channels + c) * height * width;
int top_offset = (n * channels + c) * pooled_height * pooled_width;
const T* offset_top_diff = top_diff + top_offset;
T* offset_bottom_diff = bottom_diff + bottom_offset;
const int* offset_argmax_data = argmax_data + top_offset;
计算bottom梯度。可以看出Caffe2中的实现仍然留有浓重的Caffe痕迹。
int argmax = offset_argmax_data[ph * pooled_width + pw];
if (argmax != -1) {
gpu_atomic_add(
static_cast<T>(offset_top_diff[ph * pooled_width + pw]),
offset_bottom_diff + argmax);
}
}
}
参考文献
- ROI Pooling层解析_代码原理详细解释和存在目的
- RCNN学习笔记(4):fast rcnn
- ROI Pooling层解析
- rbgirshick/fast-rcnn
- 详解 roi-pooling 层的实现
- Region of interest pooling explained
- Region of interest pooling in TensorFlow – example
- 深度 | 像玩乐高一样拆解Faster R-CNN:详解目标检测的实现过程
- roi_pooling_layer.cpp
- What is the difference between R-CNN and Fast R-CNN?
- Why is Fast-R-CNN faster than R-CNN?