Object Detection(一)
RCNN series
R C N N RCNN RCNN is the abbreviation of R i g i o n Rigion Rigion with CNN feature, the author is R o s s G i r s h i c k Ross Girshick RossGirshick, this method can be seen as the first time to use deep learning method to finish object detection task. Details of the method can be available in the paper of “Rich feature hierarchies for accurate object detection and semantic segmentation”.
RCNN
RCNN algorithm process:
step | operation |
---|---|
1 | 1 1 1K~ 2 2 2K proposal regions are generated from per image using Selective Search. |
2 | For each proposal region, deep network is used to extract features. |
3 | The features will be then fed into each SVM classifier and determine that whether the feature belongs to this category. |
4 | Use regressor to fine-tune the proposal region’s position. |
<1>Generation of proposal regions
Selective Search algorithm is used to obtain some original regions by image segmentation, then, some merging strategies are used to merge these regions to obtain a hierarchical regional structure which may contain possible objects.
<2>Extract features from proposal regions by deep network
resize all 2000 proposal regions to 227*227 pixels, then send these resized regions into a p r e pre pre-trained model, like AlexNet. We extract a 4096-dimensional feature vector from each proposal region.
<3>Score each vector using SVM trained for each class
The feature matrix is typically 2000 x 4096 and the all the SVM classifiers form a weight matrix of 4096 x N,which N is the number of class.After the dot products between feature matrix and SVM weight matrix, we obtain a 2000 x N matrix. A non-maximum suppression operation is applied to each column in this matrix in order to eliminate the overlapping proposal regions.Only so can we get some proposal regions with the highest scores for each class.
non-maximum suppression process:
step | operation |
---|---|
1 | looks for the object(proposal region) with the highest score. |
2 | calculates the IOU value of the high-score object with others. |
3 | remove all the proposal regions whose IOU value are below the threshold. |
<4> fine-tune the positions of proposal regions using the regressors
After non-maximum suppression, we need to a further operation to fine-tune positions for the remaining proposal regions so that we can obtain the bounding box with the highest score for each class.
RCNN algorithm limitation:
1.slow test speed
It will spend 53 seconds to test an image on CPU. During those time, Selective Search will approximately cost 2 seconds for using to extract the proposal regions,and for one image,there is a lot of overlap among proposal regions,so that the extracted features are redundant.
2.slow training and tedious training process
The training process is extremely tedious.
3.The training process requires huge storage space
During training process of SVM and b b o x bbox bbox regressor, features,which are then written into disk, need be extracted from each proposal region of each image.However, if the backbone is deep like VGG16, all the features extracted from 5000 training images of VOC07 need hundreds of GB storage.
fastRCNN
FastRCNN is rose’s follow-up to RCNN,it also use VGG16 as the backbone.While the training time and inference time of fastRCNN are 9 times and 213 times faster than RCNN,respectively.The accuracy rate is increasing from 62% to 66% among on Pascal VOC datasets.
f a s t R C N N fastRCNN fastRCNN algorithm process:
step | operation |
---|---|
1 | 1 1 1K~ 2 2 2K regions of interest(ROIs) are generated from per image using Selective Search. |
2 | feed the image into the neural network and get the corresponding feature map,then you need to map the regions of interest(ROIs) generated from SS algorithm onto the feature in order to attain the corresponding feature matrix. |
3 | each feature matrix will be pooled into 7x7 dimensional feature map and then mapped to a feature vector by fully connected layer.The network has two output vectors per RoI, soft-max probabilities and per-class bounding-box regression offsets. |
key point
<1>Extract the feature map from original image only once meaning
fastRCNN takes as input the entire image to produce a feature map,then extract the ROIs from the feature map to get the proposal regions. After processing with the ROI pooling layer and flatten layer,respectively, a fixed-length feature vector is done. The vectors are produced by mapping from the shared feature map,which is calculated only once.
<2>Sampling with training data
In order to ensure the balance of training data, we use IOU value to divided all the data into positive samples and negative samples which represent the object and background respectively. The sample ratio is closed to 1:1.
<3>ROI pooling layer:
the size of input image is not limited
<4>Classifier:
the classifier outputs probabilities of N+1 categories including N categories of objects and background. Each category has two probabilities which is belong to the categories or not.There are 2N+1 possibilities in total.
<5>Bounding box regressor:
the r e g r e s s o r regressor regressor outputs the 4 parameters to correct the position for each category bounding box.The number of the parameter are 4 × ( N + 1 ) 4\times(N+1) 4×(N+1) including the background.
G x ^ = P w × d x ( P ) + P x \displaystyle \hat{G_x}=P_w \times d_x(P)+P_x Gx^=Pw×dx(P)+Px G y ^ = P h × d y ( P ) + P y \displaystyle \hat{G_y}=P_h \times d_y(P)+P_y Gy^=Ph×dy(P)+Py G w ^ = P w × e x p ( d w ( P ) ) \displaystyle \hat{G_w}=P_w \times exp(d_w(P)) Gw^=Pw×exp(dw(P)) G h ^ = P h × e x p ( d h ( P ) ) \displaystyle \hat{G_h}=P_h \times exp(d_h(P)) Gh^=Ph×exp(dh(P))
<6>Multi-task loss:
L ( p , u , t u , v ) = L c l s ( p , u ) + λ [ u ≥ 1 ] L l o c ( t u , v ) L(p,u,t^u,v) = L_{cls}(p,u)+\lambda[u \geq 1]L_{loc}(t^u,v) L(p,u,tu,v)=Lcls(p,u)+λ[u≥1]Lloc(tu,v)
p p p is the s o f t m a x softmax softmax distribution of classifier,and u u u is the true label of target.
and classification loss is approximately factorized as
L
c
l
s
(
p
,
u
)
=
−
l
o
g
p
u
L_{cls}(p,u)=-log{p_u}
Lcls(p,u)=−logpu
L l o c ( t u , v ) = ∑ i ∈ x , y , w , h s m o o t h ( t i u − v ) L_{loc}(t^u,v)=\displaystyle \sum_{i\in{x,y,w,h}}smooth(t^{u}_{i}-v) Lloc(tu,v)=i∈x,y,w,h∑smooth(tiu−v)
s m o o t h L 1 ( x ) = { 0.5 x 2 if ∣ x ∣ < 1 ∣ x ∣ − 0.5 Otherwise smooth_{L_{1}}(x)=\begin{cases} 0.5x^2 &\text{if } |x|<1 \\ |x|-0.5 &\text{Otherwise} \end{cases} smoothL1(x)={0.5x2∣x∣−0.5if ∣x∣<1Otherwise
Iverson bracket
:
[
u
≥
1
]
\text{Iverson bracket}:[u\geq1]
Iverson bracket:[u≥1]
[
u
]
=
{
1
if u is True
0
Otherwise
[u]=\begin{cases} 1 &\text{if u is True } \\ 0 &\text{Otherwise} \end{cases}
[u]={10if u is True Otherwise
fasterRCNN
fastRCNN algorithm process:RPN+fastRCNN
step | operation |
---|---|
1 | feed the image into the neural network and get the corresponding feature map |
2 | a Region Proposal Network takes a image(of any size) as input and output a set of rectangular object proposals which are then mapped to feature map to get corresponding feature matrix. |
3 | each feature matrix will be pooled into 7 x 7 7x7 7x7 dimensional feature map and then mapped to a feature vector by fully connected layer.The network has two output vectors per R O I ROI ROI, soft-max probabilities and per-class bounding-box regression offsets. |
keypoint
<1>RPN
It is important to identify the difference between anchor and region proposal.Anchor is a part area cropped from original image and it will be fine-tune by the position parameters generated from R P N RPN RPN. Each anchor point is composed with 9 anchor box whose the size and scale ratio are 128 × 128 , 256 × 256 , 512 × 512 128 \times 128, 256 \times 256, 512 \times 512 128×128,256×256,512×512 and 1 : 1 , 1 : 2 , 2 : 1 1:1, 1:2, 2:1 1:1,1:2,2:1 respectively. For a 1000 × 600 × 3 1000\times600\times3 1000×600×3 image, 20 k 20k 20k anchors are be obtained. After removing the anchors that are out of the border, 6 k 6k 6k anchors are remained. Then, these anchors are fine-tuned with the position parameters generated from R P N RPN RPN and then we get 6 k 6k 6k region proposals. However, there are a lot of overlaps among these proposals. Non-maximum suppression will be performed to reduce the number of anchors to 2 k 2k 2k, whose number is the same as that after Selective Search operation.
<2>RPN Multi-task loss:
L ( { p i } , { t i } ) = 1 N c l s ∑ i L c l s ( p i , p i ∗ ) + λ 1 N r e g ∑ i p i ∗ L r e g ( t i , t i ∗ ) L({\{p_i\}},{\{t_i\}})=\frac{1}{N_{cls}} \displaystyle \sum_iL_{cls}(p_i,p_i^*)+\lambda \frac{1}{N_{reg}} \displaystyle \sum_ip_i^*L_{reg}(t_i,t_i^*) L({pi},{ti})=Ncls1i∑Lcls(pi,pi∗)+λNreg1i∑pi∗Lreg(ti,ti∗)
Softmax Cross Entropy: \text{Softmax Cross Entropy:} Softmax Cross Entropy:
L
c
l
s
=
−
l
o
g
(
p
i
)
L_{cls}=-log(p_i)
Lcls=−log(pi)
where
p
i
p_i
pi represented for
i
t
h
i_{th}
ith possibility that the
i
t
h
i_{th}
ith anchor’s prediction is the same its label.If the sample is positive,
p
i
∗
p_i^*
pi∗ refer to 1,for negative samples,
p
i
∗
p_i^*
pi∗ refer to 0.Note that SCE model predicts
2
2
2k score,which k is the number of categories.
prediction | 1 | 0 | 1 | 0 | 1 |
---|---|---|---|---|---|
possibility{negative/positive} | {0.3,0.7} | {0.2,0.8} | {0.7,0.3} | {0.7,0.3} | {0.8,0.2} |
s o f t m a x c r o s s e n t r o p y softmax cross entropy softmaxcrossentropy | − l o g 0.7 -log0.7 −log0.7$ | − l o g 0.3 -log0.3 −log0.3 | − l o g 0. -log0. −log0. | − l o g 0. -log0. −log0. | − l o g 0.2 -log0.2 −log0.2 |
Binary Cross Entropy: \text{Binary Cross Entropy:} Binary Cross Entropy:
L
c
l
s
=
−
[
p
i
∗
l
o
g
(
p
i
)
+
(
1
−
p
i
∗
)
l
o
g
(
1
−
p
i
)
]
L_{cls}=-[p_i^*log(p_i)+(1-p_i^*)log(1-p_i)]
Lcls=−[pi∗log(pi)+(1−pi∗)log(1−pi)]
where
p
i
p_i
pi represented for
i
t
h
i_{th}
ith possibility that the
i
t
h
i_{th}
ith anchor’s prediction is the same its label.If the sample is positive,
p
i
∗
p_i^*
pi∗ refer to 1,for negative samples,
p
i
∗
p_i^*
pi∗ refer to 0.Note that
B
C
E
BCE
BCE model predicts k score,which k is the number of categories
prediction | 1 | 0 | 1 | 0 | 1 |
---|---|---|---|---|---|
possibility | 0.3 | 0.7 | 0.3 | 0.8 | 0.7 |
binary cross entropy | − [ 1 × l o g 0.3 + ( 1 − 1 ) × l o g ( 1 − 0.3 ) ] -[1\times log0.3+(1-1)\times log(1-0.3)] −[1×log0.3+(1−1)×log(1−0.3)] | − [ 0 × l o g 0.7 + ( 1 − 0 ) × l o g ( 1 − 0.7 ) ] -[0\times log0.7+(1-0)\times log(1-0.7)] −[0×log0.7+(1−0)×log(1−0.7)] | − [ 1 × l o g 0.3 + ( 1 − 1 ) × l o g ( 1 − 0.3 ) ] -[1\times log0.3+(1-1)\times log(1-0.3)] −[1×log0.3+(1−1)×log(1−0.3)] | − [ 0 × l o g 0.8 + ( 1 − 0 ) × l o g ( 1 − 0.8 ) ] -[0\times log0.8+(1-0)\times log(1-0.8)] −[0×log0.8+(1−0)×log(1−0.8)] | − [ 1 × l o g 0.7 + ( 1 − 1 ) × l o g ( 1 − 0.7 ) ] -[1\times log0.7+(1-1)\times log(1-0.7)] −[1×log0.7+(1−1)×log(1−0.7)] |
Anchor Box Regression Loss: \text{Anchor Box Regression Loss:} Anchor Box Regression Loss:
L
r
e
g
(
t
i
,
t
i
∗
)
=
∑
i
s
m
o
o
t
h
L
i
(
t
i
−
t
i
∗
)
L_{reg}(t_i,t_i^*)=\displaystyle \sum_i smooth_{L_i}(t_i-t_i^*)
Lreg(ti,ti∗)=i∑smoothLi(ti−ti∗)
s
m
o
o
t
h
L
1
(
x
)
=
{
0.5
x
2
if
∣
x
∣
<
1
∣
x
∣
−
0.5
Otherwise
smooth_{L_{1}}(x)=\begin{cases} 0.5x^2 &\text{if } |x|<1 \\ |x|-0.5 &\text{Otherwise} \end{cases}
smoothL1(x)={0.5x2∣x∣−0.5if ∣x∣<1Otherwise
t
i
=
[
t
x
,
t
y
,
t
w
,
t
h
]
,
t
i
∗
=
[
t
x
∗
,
t
y
∗
,
t
w
∗
,
t
h
∗
]
t_i=[t_x,t_y,t_w,t_h], t_i^*=[t_x^*,t_y^*,t_w^*,t_h^*]
ti=[tx,ty,tw,th],ti∗=[tx∗,ty∗,tw∗,th∗]
where
t
i
t_i
ti represented the regression parameters of the
i
t
h
i_{th}
ith anchor,
t
i
∗
t_i^*
ti∗ represents the corresponding regression parameters of the
i
t
h
i_{th}
ith GT Box.
We can get the
t
i
t_i
ti from the prediction results and then calculate the
t
i
∗
t_i^*
ti∗ by
t
i
t_i
ti.The relationship between
t
i
∗
t_i^*
ti∗ and
t
i
t_i
ti are below:
t
x
=
(
x
−
x
a
)
w
a
,
t
y
=
(
y
−
y
a
)
h
a
t_x=\frac{(x-x_a)}{w_a},t_y=\frac{(y-y_a)}{h_a}
tx=wa(x−xa),ty=ha(y−ya)
t
w
=
l
o
g
(
w
)
w
a
,
t
h
=
l
o
g
(
h
)
h
a
t_w=log\frac{(w)}{w_a},t_h=log\frac{(h)}{h_a}
tw=logwa(w),th=logha(h)
t
x
∗
=
(
x
∗
−
x
a
)
w
a
,
t
y
∗
=
(
y
∗
−
y
a
)
h
a
t_x^*=\frac{(x^*-x_a)}{w_a},t_y^*=\frac{(y^*-y_a)}{h_a}
tx∗=wa(x∗−xa),ty∗=ha(y∗−ya)
t
w
∗
=
l
o
g
(
w
∗
)
w
a
,
t
h
∗
=
l
o
g
(
h
∗
)
h
a
t_w^*=log\frac{(w^*)}{w_a},t_h^*=log\frac{(h^*)}{h_a}
tw∗=logwa(w∗),th∗=logha(h∗)
Referece:
1.https://www.bilibili.com/video/BV1af4y1m7iL?p=1
2.https://www.bilibili.com/video/BV1af4y1m7iL?p=2.
3.https://www.bilibili.com/video/BV1af4y1m7iL?p=3.
4.R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
5.R. Girshick, “Fast R-CNN,” in IEEE International Conference on
Computer Vision (ICCV), 2015.
6.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun,"Faster R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks",arXiv:1506.01497, 2016.