[paper reading] FCOS
GitHub:Notes of Classic Detection Papers
2020.11.09更新:更新了Use Yourself,即对于本文的理解和想法,详情参见GitHub:Notes-of-Classic-Detection-Papers
本来想放到GitHub的,结果GitHub不支持公式。
没办法只能放到优快云,但是格式也有些乱
强烈建议去GitHub:Notes-of-Classic-Detection-Papers上下载源文件,来阅读学习!!!这样阅读体验才是最好的
当然,如果有用,希望能给个star!
topic | motivation | technique | key element | math | use yourself | relativity |
---|---|---|---|---|---|---|
FCOS | Idea Contribution | FCOS Architecture Center-ness Multi-Level FPN Prediction | Prediction Head Training Sample & Label Model Output Feature Pyramid Inference Ablation Study FCN & Detection FCOS v s . vs. vs. YOLO v1 | Symbol Definition Loss Function Center-ness Remap of Feature & Image | …… | Related Work |
文章目录
Motivation
Idea
用per-pixel prediction的方法进行object detection(通过fully convolution实现)
Contribution
-
将detection重新表述为per-pixel prediction
-
使用 multi-level prediction:
- 提升recall
- 解决重叠bounding box带来的ambiguity
-
center-ness branch
抑制bounding box中的low-quality prediction
Techniques
FCOS Architecture
backbone的默认选择是ResNet-50
Advantage
详见 [Drawbacks of Anchor](#Drawbacks of Anchor)
-
Detection与FCN-solvable task (e.g. semantic segmentation) unify到一起
一些其他任务的Idea也可以迁移到detection中(re-use idea)
-
anchor-free & proposal-free
-
消除了与anchor相关的**复杂计算 **(e.g. IoU)
获得 faster training & testing,less training memory footprint
-
在one-stage中做到了SOTA,可以用于替换PRN
-
可以快速迁移到其他的vision task (e.g. instance segmentation, key-point detection)
Center-ness
center-ness是对每个location进行预测
可以极大地提高性能
Idea
在远离center的位置会产生大量的low-quality的predicted bounding box
FCOS引入了center-ness,抑制远离center的low-quality bounding box(e.g. down-weight)
Implement
引入center-ness branch,来预测location的center-ness
在测试时:
-
通过下式计算score:
Final Score = Classification Score × Center-ness \text{Final Score} = \text{Classification Score} × \text{Center-ness} Final Score=Classification Score×Center-ness -
使用NMS滤除被抑制的bounding box
Multi-Level FPN Prediction
Multi-Level FPN Prediction能解决2个问题:
Best Possible Recall
将FCOS的Best Possible Recall提升到SOTA
Ambiguity of Ground-Truth Box Overlap
解决ground-truth box重叠带来的ambiguity,达到anchor-based程度
原因:绝大部分情况下,发生重叠的object,尺度差距都很大
Idea是:根据regress distance的不同,将需要回归的location分发到不同的feature level
具体来说:
-
计算regress target
-
根据feature level的maximum regress distance,筛选出positive sample
m i − 1 < max ( l ∗ , t ∗ , r ∗ , b ∗ ) < m i m_{i-1} < \text{max}( l^*, t^*,r^*,b^* ) < m_i mi−1<max(l∗,t∗,r∗,b∗)<mi
其中 m i m_i mi 是feature level i i i 需要regress的maximum distance
{ m 2 , m 3 , m 4 , m 5 , m 6 , m 7 } = { 0 , 64 , 128 , 256 , 512 } \{m_2, m_3, m_4, m_5, m_6, m_7\} = \{ 0,64,128,256,512 \} {m2,m3,m4,m5,m6,m7}={0,64,128,256,512}相比于原始的FPN(e.g. SSD),FCOS将不同scale的object“隐性“地分到了不同的feature level(small在浅层,large在深层)
我认为这可以看做更精致的手工设计
-
若1个location落到2个ground-truth box(e.g. ambiguity)中,则选择small box进行regression(更偏向小目标)
Key Elements
Prediction Head
Classification Branch
Regression Branch
由于regression target永远是正数,所以在regression branch的top续上exp( s i x s_ix six)(详见 [Shared Head](#Shared Head))
Shared Head
在不同feature level共享head
Advantage:
- parameter efficient
- improve performance
Drawback:
由于[Multi-Level FPN Prediction](#Multi-Level FPN Prediction)的使用,会使得不同feature level的输出范围有所不同(e.g. [0, 64] for P 3 P_3 P3,[64, 128] for P 4 P_4 P4)
为了使得identical heads可以用于different feature level:
exp
(
x
)
→
exp
(
s
i
x
)
\text{exp}(x) \rightarrow \text{exp}(s_ix)
exp(x)→exp(six)
- s i s_i si :trainable scaler,用来自动调整exp的base
Training Sample & Label
Training Sample
直接将location作为training sample(这和语义分割的FCN相似)
Label Pos/Neg
location ( x , y ) (x,y) (x,y) 为正样本的条件为:
- location ( x , y ) (x,y) (x,y) 落在 ground-truth box中
- location ( x , y ) (x,y) (x,y) 的类别 == 该ground-truth box****的类别
FCOS使用了尽可能多的foreground sample来训练(e.g. ground-truth box的全部location)
而不像anchor-based仅选用与ground-truth box高的作为正样本
也不像[CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center作为正样本
Model Output
对每个level的feature map的每个location有如下的输出:
4D Vector t ∗ \pmb t^* ttt∗
t ∗ = ( l ∗ , t ∗ , r ∗ , b ∗ ) \pmb t^* = (l^*,t^*,r^*,b^*) ttt∗=(l∗,t∗,r∗,b∗)
描述了bounding box的4个side相对于该location的relative offset
具体来说:
注意:
FCOS是对ground-truth box的每个location进行计算(并不仅仅是geometric center),所以需要预测4个量来获得boundary
像 [CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center进行预测,2个量就够了
注意:object重叠的问题可以通过 [Multi-Level FPN Prediction](#Multi-Level FPN Prediction) 解决。但如果仍发生重叠,则优先考虑小样本(选择面积最小的bounding box)
C-Dimension Vector p \pmb p ppp
实验中使用的不是 C-class classifier,而是 C 个binary classifier
Feature Pyramid
定义了5个level的feature map: { P 3 , P 4 , P 5 , P 6 , P 7 } \{P_3, P_4, P_5, P_6, P_7\} {P3,P4,P5,P6,P7}(步长为 { 8 , 16 , 32 , 64 , 128 } \{8,16,32,64,128\} {8,16,32,64,128})
- { P 3 , P 4 , P 5 } \{P_3,P_4,P_5\} {P3,P4,P5}:backbone的feature map { C c , C 4 , C 5 } \{ C_c, C_4,C_5\} {Cc,C4,C5} + 1×1 Convlution
- { P 6 , P 7 } \{P_6,P_7\} {P6,P7} : P 5 P_5 P5 & P 6 P_6 P6 经过stride=2的卷积层
Inference
-
将image输入network,在feature map F i F_i Fi 的每个location获得:
- classification score p x , y \pmb p_{x,y} pppx,y
- regression prediction t x , y \pmb t_{x,y} tttx,y
-
选择 p x , y > 0.5 \pmb p_{x,y} > 0.5 pppx,y>0.5 的location,作为positive sample
-
decode得到bounding box的coordinate
Ablation Study
Multi-Level FPN Prediction
结论:
- Best Possible Recall 并不是FCOS的问题
- Multi-Level FPN Prediction可以提高Best Possible Recall
Ambiguity Samples
结论:
-
Multi-Level FPN Prediction可以解决Ambiguity Samples的问题
即:大部分的overlap ambiguity会被分到不同的feature level,只有极少的ambiguity location还存在
With or Without Center-ness
-
Center-ness能抑制远离center的low-quality bounding box,从而大幅度提高AP
-
center-ness必须具有单独的支路
FCN & Detection
FCN 主要用于 dense prediction
其实fundamental vision task都可以unify到one single framework
而anchor的使用,实际上使得Detecntion任务偏离了neat fully convolutional per-pixel prediction framework
FCOS v s . vs. vs. YOLO v1
相比于YOLO v1只使用靠近center的point进行prediction,FCOS使用ground-truth的全部点进行prediction
对于产生的low-quality bounding box,由center-ness进行抑制
使得FCOS可以达到anchor-based detectors相近的recall
Math
Symbol Definition
-
F i ∈ R H × W × C F_i \in \mathbb{R} ^{H×W×C} Fi∈RH×W×C :backbone中第 i i i 层的feature map
-
s s s :到该层的total stride
-
{ B i } \{B_i\} {Bi} :ground-truth box
B i = ( x 0 ( i ) , y 0 ( i ) , x 1 ( i ) , y 1 ( i ) , x ( i ) ) ∈ R 4 × { 1 , 2... C } B_i = (x_0^{(i)}, y_0^{(i)},x_1^{(i)},y_1^{(i)},x^{(i)}) \in \mathbb{R}^4 × \{ 1,2...C\} Bi=(x0(i),y0(i),x1(i),y1(i),x(i))∈R4×{1,2...C}- ( x 0 ( i ) , y 0 ( i ) ) (x_0^{(i)}, y_0^{(i)}) (x0(i),y0(i)) :top-left corner coordinate
- ( x 1 ( i ) , y 1 ( i ) ) (x_1^{(i)}, y_1^{(i)}) (x1(i),y1(i)) :bottom-right corner coordinate
- c ( i ) c^{(i)} c(i) :bounding box中object的class
- C C C :number of class
Loss Function
- λ = 1 \lambda = 1 λ=1
还缺少一个Center-ness Loss,其为binary cross entropy
该损失在feature map的全部location上计算,具体来说:
-
Classification Loss在全部location上计算(positive & negative)
-
Classification Loss在positive location上计算
1 { C x , y ∗ > 0 } = 1 \mathbb{1} _{\{ C_{x,y}^* > 0\}} = 1 1{Cx,y∗>0}=1 if c i ∗ > 0 c_i^*>0 ci∗>0
Center-ness
centerness* = min ( l ∗ , r ∗ ) max ( l ∗ , r ∗ ) × min ( t ∗ , b ∗ ) max ( t ∗ , b ∗ ) \text { centerness* }=\sqrt{\frac{\min \left(l^{*}, r^{*}\right)}{\max \left(l^{*}, r^{*}\right)} \times \frac{\min \left(t^{*}, b^{*}\right)}{\max \left(t^{*}, b^{*}\right)}} centerness* =max(l∗,r∗)min(l∗,r∗)×max(t∗,b∗)min(t∗,b∗)
- center-ness反映location到对应center的normalized distance
- 使用“根号”来延缓center-ness的衰减
- center-ness的范围为 [0,1]
Remap of Feature & Image
feature map上的
(
x
,
y
)
(x,y)
(x,y),映射回原图像为:
(
⌊
s
2
⌋
+
x
s
,
⌊
s
2
⌋
+
y
s
)
\big( \lfloor \frac s2 \rfloor + xs , \lfloor \frac s2 \rfloor + ys \big)
(⌊2s⌋+xs,⌊2s⌋+ys)
该位置会靠近location
(
x
,
y
)
(x,y)
(x,y) 的对应的reception field的center
Use Yourself
Related Work
Drawbacks of Anchor
-
detection performance对size、aspect ratio、number of anchor等超参数敏感
即anchor需要精密的手工设计
-
需要大量的anchor,才能获得high recall rate
这会导致训练时极端的正负样本不均衡
-
anchor会伴随着复杂的计算
比如IoU的计算
-
anchor的size、aspect ratios都是预先定义的,导致无法应对shape variations(尤其对于小目标)
另外,anchor这种“预先定义”的形式也会影响模型的泛化能力。换句话说,设计的anchor是task-specific
DenseBox-Based
-
对image进行crop和resize,以处理不同size的bounding box
导致DenseBox必须在image pyramid上进行detection
这与FCN仅计算一次convolution的思想相悖
-
仅仅用于特定的domain,难以处理重叠的object
因为无法确定对应pixel回归到哪一个object
-
Recall比较低
Anchor-Based Detector
-
来源:
sliding window 和 proposal based detectors
-
anchor的本质
预定义的sliding window (proposal) + offset regression
-
anchor的作用
作为detector的训练数据
-
典型model
- Faster-RCNN
- SSD
- YOLO v2
YOLO v1
YOLO v1是典型的Anchor-Free Detector
Idea
YOLO v1使用靠近center的point来预测bounding box
即:object的center落到哪个grid cell,则由该cell负责预测该object的bounding box
这是因为:靠近center的points能生成质量更高的detection
Drawbacks of Points near Center
只使用靠近center的points,会导致low-racall
正因如此,YOLO v2 又重新使用了anchor
CornerNet
CornerNet是典型的Anchor-Free Detector
Steps
- corner detection
- corner grouping
- post-processing
Drawbacks of Corner
unding box**
即:object的center落到哪个grid cell,则由该cell负责预测该object的bounding box
这是因为:靠近center的points能生成质量更高的detection
Drawbacks of Points near Center
只使用靠近center的points,会导致low-racall
正因如此,YOLO v2 又重新使用了anchor
CornerNet
CornerNet是典型的Anchor-Free Detector
Steps
- corner detection
- corner grouping
- post-processing
Drawbacks of Corner
post-processing复杂,需要额外的distance metric