动机:
Is it possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPs)?
【CC】开门见山:基于不同的算力构建一族网络
We systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.
【CC】动机非常存粹:尽可能的提升网络行效率(在不损失精度,甚至提升精度);首先,在多尺度特征融合阶段提出了BiFPN结构(这也是本文最大的贡献!);其次,基于作者自己的effificientne给出一族NN应对不同的算力
解题思路:
Challenge 1: efficient multi-scale feature fusion
Since these different input features are at different resolutions, we observe they usually contribute to the fused output feature unequally. we propose a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN), which introduces learnable weights to learn the importance of different input features
【CC】观察发现不同尺度特征对最后的输出贡献是不一样的,基于这点设计一个权重可学习的双向金字塔结构用于特征融合;用MLP做weight的学些是不是也可以? 同理,用self-attention是不是也可以可以?已经有人这么干了
Challenge 2: model scaling
Recently, [36] demonstrates remarkable model efficiency for image classification by jointly scaling up network
width, depth, and resolution.We observe that scaling up feature network and box/class prediction network is also critical when taking into account both accuracy and effificiency. we propose a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network
【CC】其实是根据前人研究:将backbone/header/resolution 合起来缩放对最终精度有比较大的影响;基于这个思想作者对efficientnet+bifpn+header+resolution 进行不同尺度的缩放,形成了自己的一个网络族叫做efficientDet
BiFPN
Multi-scale feature fusion aims to aggregate features at differen