Object Detection Series_东北小b-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_41652829/article/details/120441802

Object Detection（一）

RCNN series
Referece:

RCNN series

$R C N N$ is the abbreviation of $R i g i o n$ with CNN feature, the author is $R o s s G i r s h i c k$ , this method can be seen as the first time to use deep learning method to finish object detection task. Details of the method can be available in the paper of “Rich feature hierarchies for accurate object detection and semantic segmentation”.

RCNN

RCNN algorithm process：

step	operation
1	$1$ K~ $2$ K proposal regions are generated from per image using Selective Search.
2	For each proposal region, deep network is used to extract features.
3	The features will be then fed into each SVM classifier and determine that whether the feature belongs to this category.
4	Use regressor to fine-tune the proposal region’s position.

<1>Generation of proposal regions

Selective Search algorithm is used to obtain some original regions by image segmentation, then, some merging strategies are used to merge these regions to obtain a hierarchical regional structure which may contain possible objects.

<2>Extract features from proposal regions by deep network

resize all 2000 proposal regions to 227*227 pixels, then send these resized regions into a $p r e$ -trained model, like AlexNet. We extract a 4096-dimensional feature vector from each proposal region.

<3>Score each vector using SVM trained for each class

The feature matrix is typically 2000 x 4096 and the all the SVM classifiers form a weight matrix of 4096 x N,which N is the number of class.After the dot products between feature matrix and SVM weight matrix, we obtain a 2000 x N matrix. A non-maximum suppression operation is applied to each column in this matrix in order to eliminate the overlapping proposal regions.Only so can we get some proposal regions with the highest scores for each class.

non-maximum suppression process:

step	operation
1	looks for the object(proposal region) with the highest score.
2	calculates the IOU value of the high-score object with others.
3	remove all the proposal regions whose IOU value are below the threshold.

<4> fine-tune the positions of proposal regions using the regressors

After non-maximum suppression, we need to a further operation to fine-tune positions for the remaining proposal regions so that we can obtain the bounding box with the highest score for each class.

RCNN algorithm limitation:

1.slow test speed

It will spend 53 seconds to test an image on CPU. During those time, Selective Search will approximately cost 2 seconds for using to extract the proposal regions,and for one image,there is a lot of overlap among proposal regions,so that the extracted features are redundant.

2.slow training and tedious training process

The training process is extremely tedious.

3.The training process requires huge storage space

During training process of SVM and $b b o x$ regressor, features,which are then written into disk, need be extracted from each proposal region of each image.However, if the backbone is deep like VGG16, all the features extracted from 5000 training images of VOC07 need hundreds of GB storage.

fastRCNN

FastRCNN is rose’s follow-up to RCNN，it also use VGG16 as the backbone.While the training time and inference time of fastRCNN are 9 times and 213 times faster than RCNN,respectively.The accuracy rate is increasing from 62% to 66% among on Pascal VOC datasets.

$f a s t R C N N$ algorithm process:

step	operation
1	$1$ K~ $2$ K regions of interest(ROIs) are generated from per image using Selective Search.
2	feed the image into the neural network and get the corresponding feature map,then you need to map the regions of interest(ROIs) generated from SS algorithm onto the feature in order to attain the corresponding feature matrix.
3	each feature matrix will be pooled into 7x7 dimensional feature map and then mapped to a feature vector by fully connected layer.The network has two output vectors per RoI, soft-max probabilities and per-class bounding-box regression offsets.

key point

<1>Extract the feature map from original image only once meaning

fastRCNN takes as input the entire image to produce a feature map,then extract the ROIs from the feature map to get the proposal regions. After processing with the ROI pooling layer and flatten layer,respectively, a fixed-length feature vector is done. The vectors are produced by mapping from the shared feature map,which is calculated only once.

<2>Sampling with training data

In order to ensure the balance of training data, we use IOU value to divided all the data into positive samples and negative samples which represent the object and background respectively. The sample ratio is closed to 1:1.

<3>ROI pooling layer:

the size of input image is not limited

<4>Classifier:

the classifier outputs probabilities of N+1 categories including N categories of objects and background. Each category has two probabilities which is belong to the categories or not.There are 2N+1 possibilities in total.

<5>Bounding box regressor:

the $r e g r e s s o r$ outputs the 4 parameters to correct the position for each category bounding box.The number of the parameter are $4\times(N+1)$ including the background.

$\displaystyle \hat{G_x}=P_w \times d_x(P)+P_x$ $\displaystyle \hat{G_y}=P_h \times d_y(P)+P_y$ $\displaystyle \hat{G_w}=P_w \times exp(d_w(P))$ $\displaystyle \hat{G_h}=P_h \times exp(d_h(P))$

<6>Multi-task loss:

$L(p,u,t^u,v) = L_{cls}(p,u)+\lambda[u \geq 1]L_{loc}(t^u,v)$

$p$ is the $s o f t m a x$ distribution of classifier,and $u$ is the true label of target.

and classification loss is approximately factorized as
$L_{cls}(p,u)=-log{p_u}$

$L_{loc}(t^u,v)=\displaystyle \sum_{i\in{x,y,w,h}}smooth(t^{u}_{i}-v)$

$smooth_{L_{1}}(x)=\begin{cases} 0.5x^2 &\text{if } |x|<1 \\ |x|-0.5 &\text{Otherwise} \end{cases}$

$\text{Iverson bracket}:[u\geq1]$
$[u]=\begin{cases} 1 &\text{if u is True } \\ 0 &\text{Otherwise} \end{cases}$

fasterRCNN

fastRCNN algorithm process:RPN+fastRCNN

step	operation
1	feed the image into the neural network and get the corresponding feature map
2	a Region Proposal Network takes a image(of any size) as input and output a set of rectangular object proposals which are then mapped to feature map to get corresponding feature matrix.
3	each feature matrix will be pooled into $7 x 7$ dimensional feature map and then mapped to a feature vector by fully connected layer.The network has two output vectors per $R O I$ , soft-max probabilities and per-class bounding-box regression offsets.

keypoint

<1>RPN

It is important to identify the difference between anchor and region proposal.Anchor is a part area cropped from original image and it will be fine-tune by the position parameters generated from $R P N$ . Each anchor point is composed with 9 anchor box whose the size and scale ratio are $128 \times 128, 256 \times 256, 512 \times 512$ and $1 : 1, 1 : 2, 2 : 1$ respectively. For a $1000\times600\times3$ image, $20 k$ anchors are be obtained. After removing the anchors that are out of the border, $6 k$ anchors are remained. Then, these anchors are fine-tuned with the position parameters generated from $R P N$ and then we get $6 k$ region proposals. However, there are a lot of overlaps among these proposals. Non-maximum suppression will be performed to reduce the number of anchors to $2 k$ , whose number is the same as that after Selective Search operation.

<2>RPN Multi-task loss:

$L({\{p_i\}},{\{t_i\}})=\frac{1}{N_{cls}} \displaystyle \sum_iL_{cls}(p_i,p_i^*)+\lambda \frac{1}{N_{reg}} \displaystyle \sum_ip_i^*L_{reg}(t_i,t_i^*)$

$\text{Softmax Cross Entropy:}$

$L_{cls}=-log(p_i)$
where $p_i$ represented for $i_{th}$ possibility that the $i_{th}$ anchor’s prediction is the same its label.If the sample is positive, $p_i^*$ refer to 1,for negative samples, $p_i^*$ refer to 0.Note that SCE model predicts $2$ k score,which k is the number of categories.

prediction	1	0	1	0	1
possibility{negative/positive}	{0.3,0.7}	{0.2,0.8}	{0.7,0.3}	{0.7,0.3}	{0.8,0.2}
$s o f t m a x c r o s s e n t r o p y$	$- l o g 0.7$ $	$- l o g 0.3$	$- l o g 0 .$	$- l o g 0 .$	$- l o g 0.2$

$\text{Binary Cross Entropy:}$

$L_{cls}=-[p_i^*log(p_i)+(1-p_i^*)log(1-p_i)]$
where $p_i$ represented for $i_{th}$ possibility that the $i_{th}$ anchor’s prediction is the same its label.If the sample is positive, $p_i^*$ refer to 1,for negative samples, $p_i^*$ refer to 0.Note that $B C E$ model predicts k score,which k is the number of categories

prediction	1	0	1	0	1
possibility	0.3	0.7	0.3	0.8	0.7
binary cross entropy	$-[1\times log0.3+(1-1)\times log(1-0.3)]$	$-[0\times log0.7+(1-0)\times log(1-0.7)]$	$-[1\times log0.3+(1-1)\times log(1-0.3)]$	$-[0\times log0.8+(1-0)\times log(1-0.8)]$	$-[1\times log0.7+(1-1)\times log(1-0.7)]$

$\text{Anchor Box Regression Loss:}$

$L_{reg}(t_i,t_i^*)=\displaystyle \sum_i smooth_{L_i}(t_i-t_i^*)$ $smooth_{L_{1}}(x)=\begin{cases} 0.5x^2 &\text{if } |x|<1 \\ |x|-0.5 &\text{Otherwise} \end{cases}$ $t_i=[t_x,t_y,t_w,t_h], t_i^*=[t_x^*,t_y^*,t_w^*,t_h^*]$
where $t_i$ represented the regression parameters of the $i_{th}$ anchor, $t_i^*$ represents the corresponding regression parameters of the $i_{th}$ GT Box.
We can get the $t_i$ from the prediction results and then calculate the $t_i^*$ by $t_i$ .The relationship between $t_i^*$ and $t_i$ are below:
$t_x=\frac{(x-x_a)}{w_a},t_y=\frac{(y-y_a)}{h_a}$ $t_w=log\frac{(w)}{w_a},t_h=log\frac{(h)}{h_a}$ $t_x^*=\frac{(x^*-x_a)}{w_a},t_y^*=\frac{(y^*-y_a)}{h_a}$ $t_w^*=log\frac{(w^*)}{w_a},t_h^*=log\frac{(h^*)}{h_a}$

Referece:

1.https://www.bilibili.com/video/BV1af4y1m7iL?p=1
2.https://www.bilibili.com/video/BV1af4y1m7iL?p=2.
3.https://www.bilibili.com/video/BV1af4y1m7iL?p=3.
4.R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
5.R. Girshick, “Fast R-CNN,” in IEEE International Conference on
Computer Vision (ICCV), 2015.
6.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun,"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks"，arXiv:1506.01497, 2016.