SSD paper note

最新推荐文章于 2022-08-25 17:07:09 发布

yanghaoplus

最新推荐文章于 2022-08-25 17:07:09 发布

阅读量220

点赞数

CC 4.0 BY-SA版权

分类专栏：目标检测文章标签：目标检测

本文链接：https://blog.youkuaiyun.com/yanghao201607030101/article/details/109531990

目标检测专栏收录该内容

34 篇文章

订阅专栏

SSD是一种单阶段目标检测算法，采用多尺度特征图预测不同大小和比例的目标，通过硬负例挖掘提高检测精度，实现快速准确的目标检测。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

SSD: Single Shot MultiBox Detector

author：Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3,
Scott Reed4, Cheng-Yang Fu1, Alexander C. Berg1

Abstract ：

SSD has the following features:

discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.
At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape.
the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes。
SSD is simple. it completely eliminates proposal generation and subsequent pixel or feature resampling stages(one-stage) and encapsulates all computation in a single network
competitive accuracy and much faster while providing a unified framework for both training and reference.
much better accuracy and even smaller input image size compared to other single stage method.
Code:https://github.com/weiliu89/caffe/tree/ssd

1.Introduction

Current state-of art object detection systems(Faster R-CNN) follow the pipeline:hypothesize bounding boxes, resample pixels or features for each box, and apply a highquality classifier. While this prevailed approach is too computationally intensive for embedded systems and **too slow for real-time applications ** for those system with high -end hardware.
So far, significantly increased speed comes only at the cost of significantly decreased detection accuracy.
SSD follows Overfeat and YOLO, eliminating bounding box proposals and the subsequent pixel or feature resampling stage and thus fast and accurate. It outperforms YOLO by adding a series of improvements :

using a small convolutional filter to predict object categories and offsets in bounding box locations
using separate predictors (filters) for different aspect ratio detections,
and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. (especially important).

The authors summarized their contributions as follows:

We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN).
The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio.
These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off.
Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.

2 The Single Shot Detector (SSD)

在这里插入图片描述

2.1 Model

SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections.
Add auxiliary structure to the base network to produce detections with the following key features:

Multi-scale feature maps for detection
Convolutional predictors for detection instead of fully-connected layer
Default boxes and aspect ratios：yielding (c + 4)kmn outputs for a m × n feature map.

2.2 Training

The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs。

Matching strategy：
begin by matching each ground truth box to the default box with the best jaccard overlap，then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5)。This allow the network to pick multiple default boxes with heigh IoU rather than only pick the max IoU one.
Training objective:
The overall objective lossfunction:
在这里插入图片描述
N is the number of matched default boxes, predicted box (l),ground truth box (g) parameters,
the center (cx, cy) ,default bounding box (d).
location loss:

confidence loss(softmax loss ), the weight term α is set to 1 by cross validation.

Choosing scales and aspect ratios for default boxes: To handle different object scales, utilizing feature maps from serveral differrent layers in a single network,while also sharing parameters across all object scales.
feature maps from the lower layers can improve semantic segmentation quality and global context pooled from a feature map can help smooth the segmentation results.
Feature maps from different levels within a network have different (empirical) receptive field sizes, within the SSD framework, the default boxes do not necessary need to correspond to the actual receptive fields of each layer.We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects.
The scale of the default boxes for each feature map is computed as:
在这里插入图片描述
and there are details about center of the default box and different aspect rations .

Hard negative mining : After the matching step, most of the default boxes are negatives, Instead of
using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1.
Data augmentation: sample training image by use the size of original input image between [0.1, 1] to make the model more robust ,and the aspect ratio is between 0.5 and 2. Then each sampled patch is resized to a fixed size and is horizontally flipped with probability of 0.5, in addition to applying some photo-metric distortions.

3 Experimental Results

Our experiments are all based on VGG16 。
在这里插入图片描述
Pros and cons:
Pros:SSD has less localization error, erforms really well on large objects, very robust to different object aspect ratios
Cons:SSD has more confusions with similar object categories,
much worse performance on smaller objects than bigger objects.

3.2 Model analysis

在这里插入图片描述
Data augmentation is crucial.
More default box shapes is better。
Atrous is faster.
Multiple output layers at different resolutions is better.

results on VOC，COCO,ILSVRC

3.6 Data Augmentation for Small Object Accuracy

3.7 Inference time

Using a confidence threshold of 0.01 can filter out most boxes, then apply nms with jaccard overlap of 0.45 per class and keep the top 200 detections per image.

Conclusions

A key feature of our model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network. This representation allows us to efficiently model the space of possible box shapes.

. Our SSD512 model significantly outperforms the state-of-theart Faster R-CNN [2] in terms of accuracy on PASCAL VOC and COCO, while being 3× faster. Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO [5] alternative, while producing markedly superior detection accuracy.

QA:
what is tiling?