As we move towards large-scale object detection, it is unrealistic to expect annotated training data, in the form of bounding box annotations around objects, for all object classes at sufficient scale, and so methods capable of unseen object detection are required. We propose a novel zero-shot method based on training an end-to-end model that fuses semantic at- tribute prediction with visual features to propose object bounding boxes for seen and unseen classes. While we utilize semantic features during training, our method is agnostic to semantic information for unseen classes at test-time. Our method retains the efficiency and effectiveness of YOLOv2 [1] for objects seen during training, while improving its performance for novel and unseen objects. The ability of state-of-art detection methods to learn discriminative object features to reject background proposals also limits their performance for unseen objects. We posit that, to detect unseen objects, we must incorporate semantic information into the visual domain so that the learned visual features reflect this information and leads to improved recall rates for unseen objects. We test our method on PASCAL VOC and MS COCO dataset and observed significant improve- ments on the average precision of unseen classes
随着我们走向大规模的对象检测,期望所有对象类都有足够规模的标注训练数据(以对象周围的边界框标注的形式)是不现实的,因此需要能够检测未知对象的方法。
我们提出了一种新的零样本方法,训练一个端到端的网络,该模型将语义属性预测与视觉特征相融合,为可见类和不可见类找到bounding box。我们在训练中使用语义特征,但在测试时对unseen类的语义信息是不可知的。
我们的方法保留了YOLOv2对训练过程中可知类的效率和有效性,同时提高了它对不可知类的性能。
【【最先进的检测方法学习有区别的物体特征以拒绝背景提议的能力也限制了它们对于看不见的物体的性能。】】(没看懂)
我们假设,为了检测看不见的物体,我们必须将语义信息纳入视觉领域,以便学习的视觉特征反映这一信息,从而提高unseen类的召回率。我们在PASCAL VOC和MS COCO数据集上测试了我们的方法,并观察到了unseen类的平均精度的显著提高。
//
广义零样本学习(gZSL),看得见和看不见的类别都需要被识别。
经典ZSL问题有一定的局限性:它假设对象的位置是精确地,唯一的任务是识别。事实上,有许多潜在的看不见的物体出现在其他地方。一个智能系统不仅应该能够分类识别,还应该能够定位它们。因此,在本文中,我们考虑了这种额外的复杂性来源,并引入了零样本检测(ZSD)。
ZSD=零样本+识别+定位
由于训练规模的问题,我们需要一个框架,该框架既能检测训练中看到的对象,又能检测迄今未见过的类别。
//
Motivated by these challenges, we develop a novel zero- shot detection architecture (ZS-YOLO) for detection of unseen object classes. Our method is based on a seamless integra- tion of semantic attribute predictors with YOLOv2’s visual detection architecture2. Specifically, we train an end-to-end model for zero-shot detection based on a novel multi-task loss objective, which incorporates semantic and visual information. Nevertheless, at test-time, our method is agnostic to semantic information of unseen objects, and the semantic component of our network functions as a system for identifying semantic components that resemble trained classes. We choose YOLOv2 as the base detector for zero-shot detection because it is the state-of-the-art single stage detector on existing benchmark datasets [1]. By changing the confidence loss and network backbone, our method can be easily applied to other single stage detector like SSD [11] and RetinaNet [18]. In addition,ZS-YOLO can be viewed as a variation of region proposal net- work (RPN), thus can be integrated with two-stage detectors like Faster-RCNN [9] seamlessly. Ultimately, our choice of YOLOv2 is coincidental based partially on the ease with which we can integrate other side information, and the fundamental focus of the paper is on understanding and quantifying the utility of semantic attributes for zero-shot detection.
受这些挑战的激励,我们开发了一种新型的零样本检测架构(ZS-YOLO)来检测不可见类。我们的方法基于语义属性预测器与YOLOv2的视觉检测架构的无缝集成。具体来说,我们基于一个新的多任务损失目标训练了一个端到端的ZSD模型,该模型结合了语义和视觉信息。然而,在测试时,我们的方法对看不见的对象的语义信息是不可知的,并且我们网络的语义组件作为一个系统来识别类似于训练类的语义组件。我们选择YOLOv2作为ZSD的基础检测器,因为它是现有基准数据集上最先进的一阶段检测器。通过改变置信损失和网络主干,我们的方法可以很容易地应用于其他单级检测器。此外,ZS-YOLO可被视为区域建议网络(RPN)的变体,因此可与两级检测器(如fast-RCNN[9])无缝集成。最终,我们选择YOLOv2是巧合,部分基于我们可以轻松地集成其他侧面信息,不过本文的基本重点是理解和量化语义属性对ZSD的效用。
…中间部分见pdf
Recognition vs. Object Detection: We focus on the problem of proposing object bounding boxes in presence of seen and unseen classes, although, our method can be readily extended to recognition if semantic information for unseen objects were available. We focus on object detection out of necessity (limitations of available datasets) as well as practical reasons (for objects in the wild scenarios semantic information for unseen classes is not available).
识别与对象检测:我们关注的是在存在可见和不可见类的情况下提出对象边界框的问题,尽管如果不可见对象的语义信息可用,我们的方法可以很容易地扩展到识别。我们关注对象检测是出于必要性(可用数据集的限制)和实际原因(对于野生场景中的对象,看不见的类的语义信息不可用)。
To train ZS-YOLO, we learn an end-to-end detection net- work with a hierarchical architecture (Fig. 2): In the first level, we train the network with a multi-task loss to perform (1) bounding box prediction in the visual domain; and (2) attribute prediction in the semantic domain; Next, the visual features of each bounding box proposal and its semantic attributes are further combined as a multi-modal input for the final layer to produce an objectiveness confidence score. This setup must be contrasted with existing detection frameworks which predict the confidence score based solely on the visual space.
为了训练ZS-YOLO,我们学习了一个具有层次结构的端到端检测网络(图2):在第一级,我们训练具有多任务损失的网络,以执行(1)视觉域中的检测框预测;and(2)语义域中的属性预测;接下来,将每个bounding box的视觉特征及其语义属性进一步组合为最后一层的多模态输入,以生成客观可信度分数。这种设置必须与现有检测框架进行对比,现有检测框架仅基于视觉空间预测置信度分数。
Our contributions in this paper are: (1) Novel method for zero-shot detection problem that seamlessly integrates seman- tic attribute predictors with visual features during training. (2) Dataset: We construct a new ZSD dataset with multiple seen and unseen classes splits based on existing PASCAL VOC and MS COCO dataset; New performance metrics are also introduced and discussed; (3) We develop a new ZSD detector, based on the visual network structure of YOLOv2 [1]. In contrast to state-of-art detectors, ZS-YOLO learns to predict semantic attribute as a side task during training, and produces object bounding boxes with both visual and semantic informa-tion. We observe significant improvements on both ASCAL VOC and MS COCO for unseen classes.
本文的主要贡献是:(1)提出了一种新的零样本检测方法,该方法在训练过程中将语义属性预测器与视觉特征无缝集成。(2) 数据集:我们在现有PASCAL VOC和MS COCO数据集的基础上构建了一个新的ZSD数据集,其中包含多个可见和不可见的类拆分;还介绍和讨论了新的性能指标;(3) 基于YOLOv2[1]的视觉网络结构,我们开发了一种新的ZSD检测器。与最先进的检测器相比,ZS-YLO在训练期间将预测语义属性作为辅助任务,并生成包含视觉和语义信息的对象边界框。我们观察到ASCAL VOC和MS COCO在不可见类中都有显著的改善。