[paper reading] FCOS

最新推荐文章于 2023-05-09 20:18:12 发布

Harry嗷

最新推荐文章于 2023-05-09 20:18:12 发布

阅读量1.1k

点赞数

分类专栏： paper reading Detection 文章标签：机器学习深度学习人工智能计算机视觉论文笔记

本文链接：https://blog.youkuaiyun.com/qq_41683065/article/details/109548685

版权

paper reading 同时被 2 个专栏收录

17 篇文章

订阅专栏

Detection

11 篇文章

订阅专栏

FCOS是一种无需锚点的目标检测方法，采用像素级预测并通过完全卷积网络实现。它引入了多级预测来提升召回率，并利用center-ness分支抑制低质量边界框。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

[paper reading] FCOS

GitHub：Notes of Classic Detection Papers

2020.11.09更新：更新了Use Yourself，即对于本文的理解和想法，详情参见GitHub：Notes-of-Classic-Detection-Papers

本来想放到GitHub的，结果GitHub不支持公式。
没办法只能放到优快云，但是格式也有些乱
强烈建议去GitHub：Notes-of-Classic-Detection-Papers上下载源文件，来阅读学习！！！这样阅读体验才是最好的
当然，如果有用，希望能给个star！

topic	motivation	technique	key element	math	use yourself	relativity
FCOS	Idea Contribution	FCOS Architecture Center-ness Multi-Level FPN Prediction	Prediction Head Training Sample & Label Model Output Feature Pyramid Inference Ablation Study FCN & Detection FCOS $v s .$ YOLO v1	Symbol Definition Loss Function Center-ness Remap of Feature & Image	……	Related Work

文章目录

[paper reading] FCOS

Motivation

Idea

用per-pixel prediction的方法进行object detection（通过fully convolution实现）

Contribution

将detection重新表述为per-pixel prediction
使用 multi-level prediction：
- 提升recall
- 解决重叠bounding box带来的ambiguity
center-ness branch

抑制bounding box中的low-quality prediction

Techniques

FCOS Architecture

backbone的默认选择是ResNet-50

在这里插入图片描述

Advantage

详见 [Drawbacks of Anchor](#Drawbacks of Anchor)

Detection与FCN-solvable task (e.g. semantic segmentation) unify到一起

一些其他任务的Idea也可以迁移到detection中（re-use idea）
anchor-free & proposal-free
消除了与anchor相关的**复杂计算 **(e.g. IoU)

获得 faster training & testing，less training memory footprint
在one-stage中做到了SOTA，可以用于替换PRN
可以快速迁移到其他的vision task (e.g. instance segmentation, key-point detection)

Center-ness

center-ness是对每个location进行预测

可以极大地提高性能

Idea

在远离center的位置会产生大量的low-quality的predicted bounding box

FCOS引入了center-ness，抑制远离center的low-quality bounding box（e.g. down-weight）

在这里插入图片描述

Implement

引入center-ness branch，来预测location的center-ness

在这里插入图片描述

在测试时：

通过下式计算score：
$\text{Final Score} = \text{Classification Score} × \text{Center-ness}$
使用NMS滤除被抑制的bounding box

Multi-Level FPN Prediction

Multi-Level FPN Prediction能解决2个问题：

Best Possible Recall

将FCOS的Best Possible Recall提升到SOTA

Ambiguity of Ground-Truth Box Overlap

解决ground-truth box重叠带来的ambiguity，达到anchor-based程度

原因：绝大部分情况下，发生重叠的object，尺度差距都很大

Idea是：根据regress distance的不同，将需要回归的location分发到不同的feature level

具体来说：

计算regress target
根据feature level的maximum regress distance，筛选出positive sample
$m_{i-1} < \text{max}( l^*, t^*,r^*,b^* ) < m_i$
其中 $m_i$ 是feature level $i$ 需要regress的maximum distance
${m_2, m_3, m_4, m_5, m_6, m_7\} = \{ 0,64,128,256,512 \}$

相比于原始的FPN（e.g. SSD），FCOS将不同scale的object“隐性“地分到了不同的feature level（small在浅层，large在深层）

我认为这可以看做更精致的手工设计
若1个location落到2个ground-truth box（e.g. ambiguity）中，则选择small box进行regression（更偏向小目标）

Key Elements

Prediction Head

Classification Branch

在这里插入图片描述

Regression Branch

在这里插入图片描述

由于regression target永远是正数，所以在regression branch的top续上exp( $s_ix$ )（详见 [Shared Head](#Shared Head)）

Shared Head

在不同feature level共享head

Advantage：

parameter efficient
improve performance

Drawback：

由于[Multi-Level FPN Prediction](#Multi-Level FPN Prediction)的使用，会使得不同feature level的输出范围有所不同（e.g. [0, 64] for $P_3$ ，[64, 128] for $P_4$ ）

为了使得identical heads可以用于different feature level：
$\text{exp}(x) \rightarrow \text{exp}(s_ix)$

$s_i$ ：trainable scaler，用来自动调整exp的base

Training Sample & Label

Training Sample

直接将location作为training sample（这和语义分割的FCN相似）

Label Pos/Neg

location $(x, y)$ 为正样本的条件为：

location $(x, y)$ 落在 ground-truth box中
location $(x, y)$ 的类别 == 该ground-truth box****的类别

FCOS使用了尽可能多的foreground sample来训练（e.g. ground-truth box的全部location）

而不像anchor-based仅选用与ground-truth box高的作为正样本

也不像[CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center作为正样本

Model Output

对每个level的feature map的每个location有如下的输出：

4D Vector $\pmb t^*$

$\pmb t^* = (l^*,t^*,r^*,b^*)$

描述了bounding box的4个side相对于该location的relative offset

具体来说：

在这里插入图片描述

注意：

FCOS是对ground-truth box的每个location进行计算（并不仅仅是geometric center），所以需要预测4个量来获得boundary

像 [CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center进行预测，2个量就够了

在这里插入图片描述

注意：object重叠的问题可以通过 [Multi-Level FPN Prediction](#Multi-Level FPN Prediction) 解决。但如果仍发生重叠，则优先考虑小样本（选择面积最小的bounding box）

C-Dimension Vector $\pmb p$

实验中使用的不是 C-class classifier，而是 C 个binary classifier

Feature Pyramid

定义了5个level的feature map： ${P_3, P_4, P_5, P_6, P_7\}$ （步长为 ${8,16,32,64,128\}$ ）

在这里插入图片描述

${P_3,P_4,P_5\}$ ：backbone的feature map ${ C_c, C_4,C_5\}$ + 1×1 Convlution
${P_6,P_7\}$ ： $P_5$ & $P_6$ 经过stride=2的卷积层

Inference

将image输入network，在feature map $F_i$ 的每个location获得：
- classification score $\pmb p_{x,y}$
- regression prediction $\pmb t_{x,y}$
选择 $\pmb p_{x,y} > 0.5$ 的location，作为positive sample
decode得到bounding box的coordinate

Ablation Study

Multi-Level FPN Prediction

在这里插入图片描述

结论：

Best Possible Recall 并不是FCOS的问题
Multi-Level FPN Prediction可以提高Best Possible Recall

Ambiguity Samples

在这里插入图片描述

结论：

Multi-Level FPN Prediction可以解决Ambiguity Samples的问题

即：大部分的overlap ambiguity会被分到不同的feature level，只有极少的ambiguity location还存在

With or Without Center-ness

在这里插入图片描述

Center-ness能抑制远离center的low-quality bounding box，从而大幅度提高AP
center-ness必须具有单独的支路

FCN & Detection

FCN 主要用于 dense prediction

其实fundamental vision task都可以unify到one single framework

而anchor的使用，实际上使得Detecntion任务偏离了neat fully convolutional per-pixel prediction framework

FCOS $v s .$ YOLO v1

相比于YOLO v1只使用靠近center的point进行prediction，FCOS使用ground-truth的全部点进行prediction

对于产生的low-quality bounding box，由center-ness进行抑制

使得FCOS可以达到anchor-based detectors相近的recall

Math

Symbol Definition

$F_i \in \mathbb{R} ^{H×W×C}$ ：backbone中第 $i$ 层的feature map
$s$ ：到该层的total stride
${B_i\}$ ：ground-truth box
$B_i = (x_0^{(i)}, y_0^{(i)},x_1^{(i)},y_1^{(i)},x^{(i)}) \in \mathbb{R}^4 × \{ 1,2...C\}$
- $x_0^{(i)}, y_0^{(i)})$ ：top-left corner coordinate
- $x_1^{(i)}, y_1^{(i)})$ ：bottom-right corner coordinate
- $c^{(i)}$ ：bounding box中object的class
- $C$ ：number of class

Loss Function

在这里插入图片描述

$\lambda = 1$

还缺少一个Center-ness Loss，其为binary cross entropy

该损失在feature map的全部location上计算，具体来说：

Classification Loss在全部location上计算（positive & negative）
Classification Loss在positive location上计算

$\mathbb{1} _{\{ C_{x,y}^* > 0\}} = 1$ if $c_i^*>0$

Center-ness

$\text { centerness* }=\sqrt{\frac{\min \left(l^{*}, r^{*}\right)}{\max \left(l^{*}, r^{*}\right)} \times \frac{\min \left(t^{*}, b^{*}\right)}{\max \left(t^{*}, b^{*}\right)}}$

center-ness反映location到对应center的normalized distance
使用“根号”来延缓center-ness的衰减
center-ness的范围为 [0,1]

Remap of Feature & Image

feature map上的 $(x, y)$ ，映射回原图像为：
$\big( \lfloor \frac s2 \rfloor + xs , \lfloor \frac s2 \rfloor + ys \big)$
该位置会靠近location $(x, y)$ 的对应的reception field的center

Use Yourself

Related Work

Drawbacks of Anchor

detection performance对size、aspect ratio、number of anchor等超参数敏感

即anchor需要精密的手工设计
需要大量的anchor，才能获得high recall rate

这会导致训练时极端的正负样本不均衡
anchor会伴随着复杂的计算

比如IoU的计算
anchor的size、aspect ratios都是预先定义的，导致无法应对shape variations（尤其对于小目标）

另外，anchor这种“预先定义”的形式也会影响模型的泛化能力。换句话说，设计的anchor是task-specific

DenseBox-Based

对image进行crop和resize，以处理不同size的bounding box

导致DenseBox必须在image pyramid上进行detection

这与FCN仅计算一次convolution的思想相悖
仅仅用于特定的domain，难以处理重叠的object

因为无法确定对应pixel回归到哪一个object
Recall比较低