论文:https://arxiv.org/abs/2412.20645
代码:https://github.com/THU-MIG/YOLO-UniOW
YOLO-UniOW: Efficient Universal Open-World Object Detection
YOLO UniOW:高效的通用开放世界对象检测
Abstract 摘要
Traditional object detection models are constrained by the limitations of closed-set datasets, detecting only categories encountered during training. While multimodal models have extended category recognition by aligning text and image modalities, they introduce significant inference overhead due to cross-modality fusion and still remain restricted by predefined vocabulary, leaving them ineffective at handling unknown objects in open-world scenarios. In this work, we introduce Universal Open-World Object Detection (Uni-OWD), a new paradigm that unifies open-vocabulary and open-world object detection tasks. To address the challenges of this setting, we propose YOLO-UniOW, a novel model that advances the boundaries of efficiency, versatility, and performance. YOLO-UniOW incorporates Adaptive Decision Learning to replace computationally expensive cross-modality fusion with lightweight alignment in the CLIP latent space, achieving efficient detection without compromising generalization. Additionally, we design a Wildcard Learning strategy that detects out-of-distribution objects as “unknown” while enabling dynamic vocabulary expansion without the need for incremental learning. This design empowers YOLO-UniOW to seamlessly adapt to new categories in open-world environments. Extensive experiments validate the superiority of YOLO-UniOW, achieving achieving
34.6
34.6
34.6 AP and
30.0
30.0
30.0 APr on LVIS with an inference speed of
69.6
69.6
69.6 FPS. The model also sets benchmarks on M-OWODB, S-OWODB, and nuScenes datasets, showcasing its unmatched performance in open-world object detection. Code and models are available at https:github.com/THU-MIG/YOLO-UniOW .
传统的目标检测模型受到闭集数据集的限制,仅检测训练过程中遇到的类别。虽然多模态模型通过对齐文本和图像模态扩展了类别识别,但由于跨模态融合,它们引入了显著的推理开销,并且仍然受到预定义词汇的限制,使得它们在处理开放世界场景中的未知对象时无效。在这项工作中,我们引入了通用开放世界对象检测(Uni-OWD),这是一种将开放词汇表和开放世界对象探测任务相结合的新范式。为了解决这种设置的挑战,我们提出了YOLO-UniOW,这是一种新的模型,可以提高效率、多功能性和性能的界限。YOLO UniOW结合了自适应决策学习,以CLIP潜在空间中的轻量级对齐代替计算开销较大的跨模态融合,在不影响泛化的情况下实现高效检测。此外,我们设计了一种通配符学习策略,该策略将分布外的对象检测为“未知”,同时启用动态词汇表扩展,而不需要增量学习。这种设计使YOLO UniOW能够无缝地适应开放世界环境中的新类别。大量实验验证了YOLO UniOW的优越性,在LVIS上实现了
34.6
34.6
34.6AP和
30.0
30.0
30.0APR,推理速度为
69.6
69.6
69.6FPS。该模型还设置了M-OWODB、S-OWODB和nuScenes数据集的基准测试,展示了其在开放世界对象检测中无与伦比的性能。代码和型号可在https://github.com/THU-MIG/YOLO-UniOW上获得。
图 1:速度-精度权衡曲线。YOLO-UniOW 和最近的方法在 LVIS minival 数据集上的速度和准确性的比较。推理速度是在没有TensorRT的单个NVIDIA V100 GPU上测量的。圆圈大小表示模型大小
1.Introduction 引言
Object detection has long been one of the most fundamental and widely applied techniques in the field of computer vision, with extensive applications in security 1, autonomous driving 2, and medical imaging 3.Many remarkable works have achieved breakthroughs for object detection, such as Faster R-CNN 4, SSD 5, RetinaNet 6, etc.
物体检测长期以来一直是计算机视觉领域最基本、应用最广泛的技术之一,在安全1、自动驾驶2和医学成像3方面有着广泛的应用。许多杰出的工作在目标检测方面取得了突破,如Faster R-CNN4、SSD5、RetinaNet6等。
In recent years, the YOLO (You Only Look Once) 7, 8, 9, 10 series of models has gained widespread attention for its outstanding detection performance and real-time efficiency.The recent YOLOv10 10 establishes a new standard for object detection by employing a consistent dual assignment strategy, achieving efficient NMS-free training and inference.
近年来,YOLO (You Only Look Once) 7, 8, 9, 10 系列模型因其出色的检测性能和实时性而受到广泛关注。最近的 YOLOv10 10 通过使用一致的双重分配策略为目标检测建立了一个新的标准,实现了高效的无 NMS 训练和推理。
However, traditional YOLO-based object detection models are often confined to a closed set definition, where objects of interest belong to a predefined set of categories.
然而,传统的基于YOLO的目标检测模型往往局限于闭集定义,其中感兴趣的对象属于一组预定义的类别。
In practical open-world scenarios, when encounteringunknown categories that have not been encountered in the training datasets, these objects are often misclassified as background. This inability of models to recognize novel objects can also negatively impact the accuracy of known categories, limiting their robust application in real-world scenarios.
在实际的开放世界场景中,当遇到训练数据集中没有遇到的未知类别时,这些对象通常会被误分类为背景。模型无法识别新对象也会对已知类别的准确性产生负面影响,限制了它们在现实世界场景中的稳健应用。
图 2. 检测框架的比较。(a) 具有跨模态融合的开放词汇检测器。(b) 我们具有自适应决策学习的高效开放词汇检测器。( c) 开放世界和开放词汇检测器。(d) 我们的用于开放词汇和开放世界任务的 Uni-OWD 检测器。
Thanks to the development of vision-language models, such as 11, 12, 13, 14, combining their open-vocabulary capabilities with the efficient object detection of YOLO presents an appealing and promising approach for real-time open-world object detection. YOLO-World 15 is a pioneering attempt, where YOLOv8 8 is used as the object detector, and CLIP’s text encoder is integrated as an open-vocabulary classifier for region proposals (i.e., anchors in YOLOv8). The decision boundary for object recognition is derived from representations of class names generated by CLIP’s text encoder. Additionally, a visionlanguage path aggregation network (RepVL-PAN) using reparameterization 16, 17 is introduced to comprehensively aggregate text and image features for better cross-modality fusion.
由于视觉语言模型的发展,例如11, 12, 13, 14,将它们的开放词汇能力与 YOLO 的高效对象检测相结合,为实时开放世界对象检测提供了一种很有吸引力且有前途的方法。YOLO-World15是一个开创性的尝试,其中YOLOv88被用作对象检测器,CLIP的文本编码器被集成为区域建议的开放词汇分类器(即YOLOv8中的锚点)。物体识别的决策边界来源于CLIP文本编码器生成的类名的表示。此外,引入了使用重新参数化 16, 17 的视觉语言路径聚合网络 (RepVL-PAN),以全面聚合文本和图像特征以获得更好的跨模态融合。
Although YOLO-World is effective for open-vocabulary object detection (OVD), it still relies on a predefined vocabulary of class names, which must include all categories that are expected to be detected. This reliance significantly limits its ability to dynamically adapt to newly emerging categories, as determining unseen class names in advance is inherently challenging, preventing it being truly openworld. Moreover, the inclusion of RepVL-PAN introduces additional computational costs, especially with large vocabulary sizes, making it less efficient for real-world applications.
尽管 YOLO-World 对开放词汇对象检测 (OVD) 有效,但它仍然依赖于预定义的类名称词汇表,这必须包括所有有望检测到的类别。这种依赖极大地限制了它动态适应新出现的类别的能力,因为提前确定看不见的类名称本身就具有挑战性,阻碍了它真正开放世界。此外,RepVL-PAN 的包含引入了额外的计算成本,尤其是在词汇量大的情况下,这使得它在实际应用中效率较低。
In this work, we first advocate a new setting of Universal Open-World Object Detection (Uni-OWD), in which we encourage realizing open-world object detection (OWOD) and open-vocabulary object detection (OVD) with one unified model. Specifically, it emphasizes that the model can not only recognize categories unseen during training but also effectively classify unknown objects as “unknown”. Additionally, we call for a efficient solution following YOLO-World to meet the efficiency requirement in realworld applications. To achieve these, we propose a YOLOUniOW model to achieve effective universal open-world detection but also enjoying greater efficiency.
在这项工作中,我们首先提倡一种新的通用开放世界对象检测设置(Uni-OWD),其中我们鼓励使用一个统一的模型实现开放世界对象检测(OWOD)和开放词汇对象检测(OVD)。具体来说,它强调模型不仅可以识别训练期间看不见的类别,还可以有效地将未知对象分类为“未知”。此外,我们在 YOLO-World 之后调用有效的解决方案来满足实际应用中的效率要求。为了实现这一点,我们提出了一个 YOLOUniOW 模型来实现有效的通用开放世界检测,但也享有更高的效率。
Our YOLO-UniOW emphasizes several insights for efficient Uni-OWD. (1) Efficiency. Except using recent YOLOv10 10 for the more efficient object detector, we introduce a novel adaptive decision learning strategy, dubbed AdaDL, to wipe out the expensive cross-modality visionlanguage aggregation in RepVL-PAN, as illustrated in Fig. 2 (b). The goal of AdaDL is to adaptively capture taskrelated decision representations for object detection without sacrificing the generalization ability of CLIP. Therefore, we can well align the image feature and class feature directly in the latent CLIP space with no any heavy cross-modality fusion operations, achieving efficient and outstanding detection performance (see Fig. 1). (2) Versatility. The challenge of open-world object detection (OWOD) lies in differentiating all unseen objects with only one “unknown” category without any supervision about unknown objects. To solve this issue, we design a wildcard learning method that use a wildcard embedding to unlock generic power of open-vocabulary model. This wildcard embedding is optimized through a simple self-supervised learning, which seamlessly adapts to dynamic real-world scenarios. As shown in Fig. 2 (d), our YOLO-UniOW can not only benefit from the dynamic expansion of the known category set like YOLO-World, i.e., open-vocabulary detection, but also can highlight any out-of-distribution objects with “unknown” category for open-world detection. (3) High performance. We evaluated our zero-shot open-vocabulary capability in LVIS 18, and the open-world approach in benchmarks such as M-OWODB 19, S-OWODB 20, and nuScenes 21. Experimental results show that our method can significantly outperform existing state-of-the-art methods for efficient OVD, achieving 34.6 AP, 30.0 APr on the LVIS dataset with the speed of 69.6 FPS. Besides, YOLO-UniOW can also perform well in both zero-shot and task-incremental learning for open-world evaluation. These well demonstrate the effectiveness of the proposed YOLO-UniOW.
我们的 YOLO-UniOW 强调了对高效 Uni-OWD 的几个见解。(1) 效率。除了使用最近的YOLOv1010进行更有效的对象检测器外,我们引入了一种新的自适应决策学习策略AdaDL,以消除RepVL-PAN中昂贵的跨模态视觉语言聚合,如图2 (b)所示。AdaDL的目标是在不牺牲CLIP泛化能力的情况下自适应地捕获用于目标检测的任务相关决策表示。因此,我们可以直接在潜在空间中对齐图像特征和类特征,而不需要任何繁重的跨模态融合操作,实现了高效、出色的检测性能(见图1)。(2) 多功能性。开放世界对象检测 (OWOD) 的挑战在于区分所有看不见的对象,只有一个“未知”类别,没有任何对未知对象的监督。为了解决这个问题,我们设计了一种通配符学习方法,该方法使用通配符嵌入来解锁开放词汇模型的通用能力。这种通配符嵌入通过简单的自我监督学习进行了优化,这无缝地适应了动态的现实世界场景。如图 2 (d) 所示,我们的 YOLO-UniOW 不仅可以受益于 YOLO-World 等已知类别集的动态扩展,即开放词汇检测,还可以突出任何具有“未知”类别的分布外对象用于开放世界检测。(3) 高性能。我们评估了LVIS18中的零镜头开放词汇能力,以及M-OWODB19、S-OWODB20和nuScenes21等基准测试中的开放世界方法。实验结果表明,该方法的性能明显优于现有的高效OVD方法,在LVIS数据集上实现了
34.6
34.6
34.6 AP、
30.0
30.0
30.0 APr,速度为
69.6
69.6
69.6 FPS。此外,YOLO-UniOW 在开放世界评估的零样本和任务增量学习中也表现良好。这些很好地证明了所提出的 YOLO2 UniOW 的有效性。
The contributions of this work are as follows:
这项工作的贡献如下:
• We advocate a new setting of Universal Open-World Object Detection, dubbed Uni-OWD to solve the challenges of dynamic object categories and unknown target recognition with one unified model. We provide an efficient solution based on YOLO detector, ending up with our YOLO-UniOW.
我们提倡一种新的通用开放世界对象检测设置,称为 Uni-OWD,以解决具有统一模型的动态对象类别和未知目标识别的挑战。我们提供了一个基于 YOLO 检测器的有效解决方案,最终得到我们的 YOLO-UniOW。
• We design a novel adaptive decision learning (AdaDL) strategy to adapt the representation of decision boundary into the task of Uni-OWD without sacrificing the generalization ability of CLIP. Thanks to AdaDL, we can leave out the heave computation of cross-modality fusion operation used in previous works.
我们设计了一种新的自适应决策学习(AdaDL)策略,在不牺牲CLIP泛化能力的情况下,将决策边界的表示适应于Uni-OWD的任务。由于 AdaDL,我们可以省略以前工作中使用的跨模态融合操作的升沉计算。
• We introduce wildcard learning to detect unknown objects, enabling iterative vocabulary expansion and seamless adaptation to dynamic real-world scenarios. This strategy eliminates the reliance on incremental learning strategies.
我们引入了通配符学习来检测未知对象,从而实现迭代词汇扩展和无缝适应动态现实场景。该策略消除了对增量学习策略的依赖。
• Extensive experiment across benchmark for both openvocabulary object detection and open-world object detection show that YOLO-UniOW can significantly outperform existing methods, well demonstrating its versatility and superiority.
在开放词汇对象检测和开放世界对象检测的基准上进行的大量实验表明,YOLO-UniOW 可以显着优于现有方法,很好地展示了其多功能性和优越性。
2. Related Work 相关工作
2.1. Open-Vocabulary Object Detection 开放词汇表对象检测
Open-Vocabulary Object Detection (OVD) has emerged as a prominent research direction in computer vision in recent years. Unlike traditional object detection, OVD enables the detection dynamically expand categories without relying heavily on the fixed set of categories defined in the training dataset. Several works have explored leveraging Vision-Language Models (VLMs) for enhancing object detection. For instance, 15, 22, 23, 24, 25, 26, [^60], 27, 28 utilize large-scale, easily accessible text-image pairs for pretraining, resulting in more robust and generalizable detectors, which are subsequently fine-tuned on specific target datasets. In parallel, 29, 30, 31, 32 focus on distilling the alignment of visual-text knowledge from VLMs into object detection, emphasizing the design of distillation losses and the generation of object proposals. Additionally, 33, 34, [^54] investigate various prompt modeling techniques to more effectively transfer VLM knowledge to the detector, enhancing its performance in open-vocabulary and unseen category tasks.
近年来,Open-Vocabulary 对象检测 (OVD) 已成为计算机视觉的一个重要研究方向。与传统的目标检测不同,OVD 能够动态扩展类别,而不依赖于训练数据集中定义的固定类别集。一些工作探索了利用视觉语言模型 (VLM) 来增强目标检测。例如,15,22,23,24,25,26,[60],[^65],[^68]利用大规模、易于访问的文本图像对进行预训练,从而产生更健壮和可概括的检测器,随后在特定目标数据集上进行微调。同时,[^12],[^36],[^38],[^53]专注于从VLM中提取视觉-文本知识的对齐到目标检测中,强调蒸馏损失的设计和对象提议的生成。此外,[^7],[^11],[54]。研究了各种提示建模技术,以更有效地将VLM知识转移到检测器中,提高了它在开放词汇和看不见的类别任务中的表现。
2.2. Open-World Object Detection 开放世界对象检测
Open-World Object Detection (OWOD) is an emerging direction in object detection, aiming to address the challenge of dynamic category detection. The goal is to enable detection models to identify known categories while recognizing unknown categories, and to incrementally adapt to new categories over time. Through methods such as manual annotation or active learning 35, 36, 37, unknown categories can be progressively converted into known categories, facilitating continuous learning and adaptation.
开放世界目标检测(OWOD)是目标检测的一个新兴方向,旨在解决动态类别检测的挑战。目标是使检测模型能够识别已知类别,同时识别未知类别,并随着时间的推移逐步适应新的类别。通过手动注释或主动学习等方法35,36,37,可以将未知类别逐步转换为已知类别,促进持续学习和适应。
The concept of OWOD was first introduced by Joseph et al. 38, whose framework relies on incremental learning. By incorporating an energy-based object recognizer into the detection head, the model gains the ability to identify unknown categories. However, this method depends on replay mechanisms, requiring access to historical task data to update the model. Additionally, it often exhibits a bias toward known categories when handling unknown objects, limiting its generalization capabilities. To address these limitations, many subsequent studies have been proposed. For instance, [35, 67] improved the experimental setup for OWOD by introducing more comprehensive benchmark datasets and stricter evaluation metrics, enhancing the robustness of unknown category detection. While these improvements achieved promising results in controlled experimental settings, their adaptability to complex scenarios and dynamic category changes remains inadequate. Recent research has shifted focus toward optimizing the feature space to better separate known and unknown categories. Methods such as 39, 40, 41, 42 propose advancements in feature space extraction, enabling models to more effectively extract feature information for the localization and identification of unknown objects. Recently, several methods 43, 44, 45 have emerged, leveraging pretrained models for open-world object detection and achieving significant improvements.
Joseph等人38首先介绍了OWOD的概念,该框架依赖于增量学习。通过将基于能量的对象识别器合并到检测头中,该模型获得了识别未知类别的能力。然而,这种方法依赖于重放机制,需要访问历史任务数据来更新模型。此外,在处理未知对象时,它通常对已知类别表现出偏差,从而限制了其泛化能力。为了解决这些限制,已经提出了许多后续研究。例如,46, 47通过引入更全面的基准数据集和更严格的评估指标改进了OWOD的实验设置,增强了未知类别检测的鲁棒性。虽然这些改进在受控实验设置中取得了有希望的结果,但它们对复杂场景的适应性和动态类别变化仍然不足。最近的研究已转向优化特征空间以更好地分离已知和未知类别。39, 40, 41, 42等方法提出了特征空间提取的进步,使模型能够更有效地提取特征信息,用于未知对象的定位和识别。最近,出现了几种方法 43, 44, 45,利用预训练模型进行开放世界对象检测并实现显着的改进。
2.3. Parameter Efficient Learning 参数高效学习
Prompt learning has emerged as a significant research direction in both natural language processing (NLP) and computer vision. By providing carefully designed prompts to pre-trained large models such as 13, prompt learning enables models to perform specific tasks in unsupervised or semi-supervised settings efficiently. Methods such as 48, 49, 50, 51, 52, 53 introduce learnable prompt embeddings, moving beyond fixed, handcrafted prompts to enhance flexibility across various visual downstream tasks. And DetPro 33 is the first to apply it to open-vocabulary object detection, achieving significant improvements using learnable prompts derived from text inputs.
快速学习已成为自然语言处理 (NLP) 和计算机视觉的重要研究方向。通过提供精心设计的提示来预训练大型模型,例如 13,提示学习使模型能够有效地在无监督或半监督设置中执行特定任务。48, 49, 50, 51, 52, 53等方法引入了可学习的提示嵌入,超越了固定的手工提示,以增强各种视觉下游任务的灵活性。DetPro 33 是第一个将其应用于开放词汇对象检测的人,使用从文本输入派生的可学习提示取得了显着的改进。
Low-Rank Adaptation (LoRA) 54 and its derivatives [^29], 55, 56, as a parameter-efficient fine-tuning technique, has demonstrated outstanding performance in adapting large models. By inserting trainable low-rank decomposition modules into the weight matrices of pre-trained models without altering the original weights, LoRA significantly reduces the number of trainable parameters. CLIPLoRA 55 introduces LoRA into VLM models as a replacement for adapters and prompts, enabling fine-tuning for downstream tasks with faster training speeds and improved performance.
低秩自适应(LoRA)54及其衍生物[^29], 55, 56作为一种参数高效的微调技术,在适应大型模型方面表现出色。通过在不改变原始权重的情况下将可训练的低秩分解模块插入预训练模型的权重矩阵中,LoRA显著减少了可训练参数的数量。CLIPLoRA55将LoRA引入VLM模型,作为适配器和提示的替代品,能够以更快的训练速度和更高的性能对下游任务进行微调。
图 3. 我们提出的高效通用开放世界对象检测管道。Open-Vocabulary Pretraining(左):使用多模态双头匹配进行有效的端到端目标检测,文本编码器中的 AdaDL 进行自适应决策边界学习。开放世界微调(右):利用校准的文本嵌入和检测器在通配符的帮助下自适应地检测已知和未知对象。采用过滤策略去除重复的未知预测,确保高效有效的开放世界目标检测。
3. Efficient Universal Open-World Object Detection 高效的通用开放世界目标检测
3.1. Problem Definition 问题定义
Universal Open-World Object Detection (Uni-OWD) extends the challenges of Open Vocabulary Detection (OVD) and Open-World Object Detection (OWOD), aiming to create a unified framework that not only detects known objects in the vocabulary but also dynamically adapts to unknown objects while maintaining scalability and efficiency in realworld scenarios.
通用开放世界对象检测 (Uni-OWD) 扩展了开放词汇检测 (OVD) 和开放世界对象检测 (OWOD) 的挑战,旨在创建一个统一框架,不仅可以检测词汇表中的已知对象,还可以动态适应未知对象,同时保持现实场景的可扩展性和效率。
Define the object category set as
C
=
C
k
∪
C
u
n
k
C = C_k ∪ C_{unk}
C=Ck∪Cunk, where
C
k
C_k
Ck represents the set of known categories,
C
u
n
k
C_{unk}
Cunk represents the set of unknown categories, and
C
k
∩
C
u
n
k
=
∅
C_k ∩ C_{unk} = ∅
Ck∩Cunk=∅. Given an input image
I
\mathcal{I}
I and a vocabulary
V
\mathcal{V}
V, the goal of Uni-OWD is to design a detector
D
\mathcal{D}
D that satisfies the following objectives:
定义对象类别集为
C
=
C
k
∪
C
u
n
k
C = C_k∪C_{unk}
C=Ck∪Cunk,其中
C
k
C_k
Ck表示已知类别集,
C
u
n
k
C_{unk}
Cunk表示未知类别集,
C
k
∩
C
u
n
k
=
∅
Ck∩Cunk =∅
Ck∩Cunk=∅。给定一个输入图像
I
\mathcal{I}
I 和一个词汇表
V
\mathcal{V}
V,Uni-OWD 的目标是设计一个满足以下目标的检测器
D
\mathcal{D}
D:
- For each category
c
k
∈
C
k
c_k ∈ C_k
ck∈Ck, represented by its text
T
c
k
∈
V
\mathcal{T}_{ck}∈ \mathcal{V}
Tck∈V, the detector
D
\mathcal{D}
D should accurately predict the bounding boxes
B
c
k
\mathcal{B}_{ck}
Bck and their associated category labels
c
k
c_k
ck by
D
(
I
,
V
)
→
{
(
b
,
c
k
)
∣
b
∈
B
c
k
,
c
k
∈
C
k
}
\mathcal{D}(\mathcal{I}, \mathcal{V}) → \{(b, c_k) | b ∈ \mathcal{B}_{ck} , c_k ∈ C_k\}
D(I,V)→{(b,ck)∣b∈Bck,ck∈Ck}
对于每个类别 c k ∈ C k c_k∈C_k ck∈Ck,由其文本 T c k ∈ V \mathcal{T}_{ck}∈ \mathcal{V} Tck∈V表示,检测器 D \mathcal{D} D应该通过 D ( I , V ) → { ( b , c k ) ∣ b ∈ B c k , c k ∈ C k } \mathcal{D}(\mathcal{I}, \mathcal{V})→\{(b, c_k) | b∈\mathcal{B}_{ck}, c_k∈C_k\} D(I,V)→{(b,ck)∣b∈Bck,ck∈Ck} - For objects belonging to
C
u
n
k
C_{unk}
Cunk, the detector should identify their bounding boxes
B
c
k
\mathcal{B}_{ck}
Bck and assign them the generic label “unknown” with a wildcard
T
w
\mathcal{T}_w
Tw, such that:
D
(
I
,
T
w
)
→
{
(
b
,
u
n
k
n
o
w
n
)
∣
b
∈
B
u
n
k
}
D(I, T_w) → \{(b, unknown) | b ∈ B_{unk}\}
D(I,Tw)→{(b,unknown)∣b∈Bunk}
对于属于 C u n k C_{unk} Cunk的对象,检测器应识别其边界框 B u n k \mathcal{B}_{unk} Bunk,并为其分配带有通配符 T w \mathcal{T}_w Tw的通用标签“unknown”,例如: D ( I , T w ) → { ( b , u n k n o w n ) ∣ b ∈ B u n k } D(I, T_w) → \{(b, unknown) | b ∈ B_{unk}\} D(I,Tw)→{(b,unknown)∣b∈Bunk} - The detector can iteratively expand the known category set
C
k
C_k
Ck and vocabulary
V
\mathcal{V}
V by discovering new categories
C
n
e
w
C_{new}
Cnew from
C
u
n
k
C_{unk}
Cunk, represented as
C
k
t
+
1
=
C
k
t
∪
C
n
e
w
C^{t+1}_k = C^t_k ∪ C_{new}
Ckt+1=Ckt∪Cnew
检测器可以通过从 C u n k C_{unk} Cunk中发现新的类别 C n e w C_{new} Cnew来迭代地扩展已知的类别集 C k C_k Ck和词汇表 V \mathcal{V} V,表示为 C k t + 1 = C k t ∪ C n e w C^{t+1}_k = C^t_k ∪ C_{new} Ckt+1=Ckt∪Cnew
The Uni-OWD framework is designed to develop a detector that leverages a textual vocabulary and a wildcard to identify both known and unknown object categories within an image, combining the strengths of open-vocabulary and open-world detection tasks. It ensures precise detection and classification for known categories while assigning a generic “unknown” label to unidentified objects. This design promotes adaptability and scalability, making it wellsuited for dynamic and real-world applications.
Uni-OWD框架旨在开发一种检测器,利用文本词汇表和通配符来识别图像中已知和未知的物体类别,结合了开放词汇和开放世界检测任务的优点。它确保对已知类别进行精确检测和分类,同时为未识别物体分配通用的"未知"标签。这种设计增强了适应性和可扩展性,使其非常适合动态的真实世界应用场景。
3.2. Efficient Adaptive Decision Learning 高效自适应决策学习
Designing a universal open-world object detection model suitable for deployment on edge and mobile devices demands a strong emphasis on efficiency. Traditional openvocabulary detection models 15, 23, 25, 27 align text and image modalities by introducing fine-grained fusion operations in the early layers. Then they rely on contrastive learning for both modalities to establish decision boundaries for object classification, enabling the model to adapt dynamically to novel classes during inference by leveraging new textual inputs.
设计一个适用于边缘和移动设备的通用开放世界目标检测模型需要高度重视效率。传统的开放词汇检测模型 15, 23, 25, 27 通过在早期层引入细粒度融合操作来对齐文本和图像模态。然后它们依赖两种模态的对比学习来建立目标分类的决策边界,使模型能够通过利用新的文本输入在推理过程中动态适应新类别。
YOLO-World 15 proposed an efficient architecture, RepVL-PAN, to perform image-text fusion through reparameterization. Despite its advancements, the model’s inference speed is still heavily influenced by the number of textual class inputs. This poses a challenge for low-compute devices, where performance degrades sharply as the number of text inputs increases, making it unsuitable for real-time detection tasks in complex, multi-class scenarios. To address this, we propose a adaptive decision learning strategy (AdaDL) to eliminate the heavy early-layer fusion operation.
YOLO-World 15提出了一种高效架构RepVL-PAN,通过重参数化实现图文融合。尽管取得了进展,该模型的推理速度仍受文本类别输入数量的严重影响。这对低算力设备构成挑战:随着文本输入量增加,性能急剧下降,使其难以胜任复杂多类别场景的实时检测任务。为此,我们提出自适应决策学习策略(AdaDL)来消除繁重的早期层融合操作。
During the construction of decision boundaries, most existing methods freeze the text encoder and rely on pretrained models, such as BERT57 or CLIP13, to extract textual features for interaction with visual features. Without a fusion structure, the text features struggle to capture image-related information dynamically, leading to suboptimal multimodal decision boundary construction when adjustments are made solely to the image features. To overcome this, our AdaDL strategy aims to enhance the decision representation during training for the Uni-OWD scenario. Specifically, during training, we introduce efficient parameters into the text encoder by incorporating Low-Rank Adaptation (LoRA) into all query, key, value and output projection layers, which can be described as:
在构建决策边界的过程中,现有方法大多会冻结文本编码器并依赖预训练模型(如BERT57或CLIP13)提取文本特征以与视觉特征交互。由于缺乏融合结构,文本特征难以动态捕捉图像相关信息,导致仅调整图像特征时多模态决策边界的构建效果欠佳。为解决这一问题,我们提出的AdaDL策略旨在增强Uni-OWD场景下训练时的决策表征。具体而言,在训练阶段,我们通过将低秩自适应(LoRA)引入所有查询、键、值和输出投影层,向文本编码器注入高效参数,该过程可表述为:
h
=
W
′
x
=
W
0
x
+
∆
W
x
(1)
h = W '^{x} = W_0x + ∆W x \tag{1}
h=W′x=W0x+∆Wx(1)
where
W
0
W_0
W0 represents the pretrained weights of the CLIP text encoder, and
∆
W
∆W
∆W is the product of two low-rank matrices. The model’s input and output are
x
x
x and
h
h
h. The rank is set to a value much smaller than the model’s feature dimension. This strategy ensures that the pre-trained parameters of the text encoder remain unchanged while low-rank matrices dynamically store information related to cross-modality interactions during training. By continuously calibrating the outputs of text encoder, this method allows the decision boundaries constructed by both modalities to adapt more effectively to each other. In practice, the calibrated text embeddings can be precomputed and stored offline during inference, thereby avoiding the computational cost of the text encoder.
其中,
W
0
W_0
W0表示CLIP文本编码器的预训练权重,
Δ
W
ΔW
ΔW由两个低秩矩阵乘积构成。模型的输入和输出分别为
x
x
x与
h
h
h。秩的取值远小于模型特征维度。该策略确保文本编码器的预训练参数保持不变,同时低秩矩阵在训练期间动态存储跨模态交互相关信息。通过持续校准文本编码器的输出,此方法使双模态构建的决策边界能更有效地相互适配。实际应用中,经过校准的文本嵌入可在推理阶段预先计算并离线存储,从而规避文本编码器的计算开销。
YOLOv10 as the efficient object detector. To improve the efficiency, we accommodate the proposed adaptive decision learning strategy into the recent advanced YOLOv10 10 as the efficient object detector. We employ a multimodal dual-head match to adapt the decision boundary for both classification heads in YOLOv10. Specifically, during the region-text contrastive learning between the region anchor and the class text, we refine the region embeddings from two heads by aligning them with shared, semantically rich text representations, enabling seamless endto-end training and inference. Furthermore, we integrate a consistent dual alignment strategy for region contrastive learning, where the dual-head matching process is formalized as:
YOLOv10作为高效目标检测器。为提升效率,我们将提出的自适应决策学习策略融入当前先进的YOLOv10框架中作为高效目标检测器。我们采用多模态双头匹配机制来适配YOLOv10中两个分类头的决策边界。具体而言,在区域锚点与类别文本之间进行区域-文本对比学习时,通过将双头输出的区域嵌入与共享的语义丰富文本表征对齐,从而精炼区域嵌入表示,实现无缝的端到端训练与推理。此外,我们还整合了区域对比学习的双头一致性对齐策略,其双头匹配过程可形式化表示为:
m
(
α
,
β
)
=
s
α
×
u
β
(2)
m(α, β) = s^α × u^β \tag2
m(α,β)=sα×uβ(2)
where
u
u
u represents the IoU value between the predicted box and the ground truth box.
s
s
s is the classification score obtained by multi-modal information, which is derived as:
其中
u
u
u代表预测框与真实框之间的IoU值。
s
s
s是通过多模态信息获得的分类分数,其计算公式为:
s
=
s
i
m
(
I
,
T
)
(3)
s = sim(I, T ) \tag3
s=sim(I,T)(3)
where
s
i
m
(
⋅
,
⋅
)
sim(·, ·)
sim(⋅,⋅) is the cosine similarity,
T
T
T is the embeddings from the text
T
∈
V
\mathcal{T} ∈ \mathcal{V}
T∈V and
I
I
I is the pixel-level features from image
L
\mathcal{L}
L. To ensure minimal supervision gap between the both heads during multimodel dual-head matching, we adopt the consistent settings, where
α
o
2
o
=
α
o
2
m
α_{o2o} = α_{o2m}
αo2o=αo2m and
β
o
2
o
=
β
o
2
m
β_{o2o} = β_{o2m}
βo2o=βo2m. This allows the one-to-one head to effectively learn consistent supervisory signals with one-to-many head.
其中
s
i
m
(
⋅
,
⋅
)
sim(·, ·)
sim(⋅,⋅) 表示余弦相似度,
T
T
T 代表文本
T
∈
V
\mathcal{T} ∈ \mathcal{V}
T∈V的嵌入向量,
I
I
I 表示图像
L
\mathcal{L}
L 的像素级特征。为确保多模态双头匹配过程中两个头之间的监督差距最小化,我们采用一致性设置,令
α
o
2
o
=
α
o
2
m
α_{o2o} = α_{o2m}
αo2o=αo2m 且
β
o
2
o
=
β
o
2
m
β_{o2o} = β_{o2m}
βo2o=βo2m。这使得一对一匹配头能够有效地学习与一对多匹配头相一致的监督信号。
图4. 已知/通配类别学习流程。先前已知类别的文本嵌入保持冻结,而当前已知类别的嵌入通过真实标签进行微调。"未知"通配类别由经过良好调优的通配预测生成的伪标签进行监督。图中显示经过调优的通配预测分数,以及与已知类别真实标注框具有低置信度分数或高IoU值(虚线框)的预测框会被过滤掉。
As a result, the calibrated text encoder and YOLO structure can operate entirely independently in the early stages, eliminating the need for fusion operations while efficiently adapting to better multimodal decision boundaries.
因此,校准后的文本编码器和YOLO结构在早期阶段可以完全独立运行,无需融合操作,就能高效适应更优的多模态决策边界。
3.3. Open-World Wildcard Learning 开放世界通配符学习
In the previous section, we introduced the AdaDL to improve the efficiency of open-vocabulary object detection, mitigating the impact of large input class text on inference latency meanwhile improves its performance. This strategy enables real-world applications to expand the vocabulary while maintaining high efficiency, covering as many objects as possible. However, open-vocabulary models inherently rely on predefined vocabularies to detect and classify objects, which limits their capability in real-world scenarios. Some objects are difficult to predict or describe using textual inputs, making it challenging for open-vocabulary models to detect these out-of-vocabulary instances.
在前一部分中,我们介绍了AdaDL以提高开放词汇目标检测的效率,在减轻大输入类别文本对推理延迟影响的同时提升了性能。该策略使实际应用能够在保持高效率的同时扩展词汇量,尽可能覆盖更多目标。然而,开放词汇模型本质上依赖于预定义词汇来检测和分类目标,这限制了其在现实场景中的能力。某些目标难以通过文本输入进行预测或描述,使得开放词汇模型检测这些词汇外实例具有挑战性。
To address this, we propose a wildcard learning approach that enables the model to detect objects not present in the vocabulary and label them as “unknown” rather than ignoring them. Specifically, we directly leverage a wildcard embedding to unlock generic power of open-vocabualry model. As shown in Tab. 4, after the decision adaptation, the wildcard
T
w
\mathcal{T}_w
Tw (e.g. “object”) demonstrates remarkable capability in capturing unknown objects within a scene in a zero-shot manner. To further enhance its effectiveness, we fine-tune its text embedding on the pretraining dataset for a few epochs. During this process, all ground-truth instances are treated as belonging to the same “object” class. This fine-tuning enables the embedding to capture richer semantics, empowering the model to identify objects that might have been overlooked by the predefined specific classes.
为此,我们提出了一种通配符学习方法,使模型能够检测词汇表中未出现的对象,并将其标记为"未知"而非忽略。具体而言,我们直接利用通配符嵌入来释放开放词汇模型的泛化能力。如表4所示,经过决策适配后,通配符
T
w
\mathcal{T}_w
Tw(如"物体")以零样本方式展现出捕捉场景中未知物体的卓越能力。为进一步提升效果,我们在预训练数据集上对其文本嵌入进行了少量轮次的微调。在此过程中,所有真实实例均被视为属于同一个"物体"类别。这种微调使嵌入能够捕获更丰富的语义,从而让模型可以识别那些可能被预定义特定类别所遗漏的物体。
To avoid duplicate predictions for known classes, we utilize this well-tuned wildcard embedding
T
o
b
j
T_{obj}
Tobj to teach an “unknown” wildcard embedding
T
u
n
k
T_{unk}
Tunk. The “unknown” wildcard is trained in a self-supervised manner without relying on ground-truth labels of “unknown” class. As shown in Fig. 4, predictions that has the highest similarity score with
T
o
b
j
T_{obj}
Tobj across all known classes embeddings are used as pseudo label candidates. To further refine these candidates, we introduce a simple selection process:
为了避免对已知类别进行重复预测,我们利用这个经过良好调整的通配符嵌入
T
o
b
j
T_{obj}
Tobj来指导“未知”通配符嵌入
T
u
n
k
T_{unk}
Tunk的训练。该“未知”通配符以自监督的方式进行训练,无需依赖“未知”类别的真实标注标签。如图4所示,在与所有已知类别嵌入中与
T
o
b
j
T_{obj}
Tobj相似度得分最高的预测被用作伪标签候选。为了进一步优化这些候选结果,我们引入了一个简单的筛选流程:
Φ
(
s
,
u
)
=
{
1
,
i
f
(
u
<
σ
1
)
∧
(
s
>
σ
2
)
0
,
o
t
h
e
r
w
i
s
e
(4)
Φ(s, u)=\begin{cases} 1, & if(u < σ1) ∧ (s > σ2) \\ 0, & otherwise \end{cases} \tag4
Φ(s,u)={1,0,if(u<σ1)∧(s>σ2)otherwise(4)
where
u
u
u is the maximum IoU between predictions and known class ground truth boxes. And the predictions with
u
u
u below a threshold
σ
1
σ_1
σ1 or classification score
s
s
s above a threshold
σ
2
σ_2
σ2 are selected. The remaining predictions are assigned to Tunk as target labels.
其中,
u
u
u 表示预测框与已知类别真实框之间的最大交并比(IoU)。当预测框的
u
u
u 低于阈值
σ
1
σ_1
σ1 或分类分数
s
s
s 高于阈值
σ
2
σ_2
σ2 时,这些预测框将被选中。其余的预测框则被分配为
T
u
n
k
T_{unk}
Tunk 作为目标标签。
For known classes, only their corresponding text embeddings
T
k
T_k
Tk from
T
k
T_k
Tk ∈
V
\mathcal{V}
V are fine-tuned on downstream tasks by multimodal dual-head match to enhance their similarity scores
s
k
s_k
sk aligned with target score
o
k
o_k
ok. These embeddings are subsequently frozen to preserve performance and avoid degradation when new classes are introduced. Unlike traditional open-world methods [15, 21] relying on exemplar replay to do incremental learning, our method can avoid catastrophic forgetting without extra exemplars, as each text embeddings are fine-tuned independently.
对于已知类别,仅通过多模态双头匹配在下游任务中微调其对应的文本嵌入
T
k
T_k
Tk(
T
k
T_k
Tk ∈
V
\mathcal{V}
V),以增强与目标分数
o
k
o_k
ok对齐的相似度得分
s
k
s_k
sk。随后冻结这些嵌入以保持性能,并避免引入新类别时出现性能下降。与传统开放世界方法[15, 21]依赖样本回放进行增量学习不同,我们的方法无需额外样本即可避免灾难性遗忘,因为每个文本嵌入都是独立微调的。
Since
T
k
T_k
Tk,
T
o
b
j
T_{obj}
Tobj and
T
u
n
k
T_{unk}
Tunk calculate similarity scores only in frozen classification head, there is no loss from box regression, focusing exclusively on learning class-specific information. The soft target scores of
T
u
n
k
T_{unk}
Tunk are directly derived from the similarity scores
s
o
b
j
s_{obj}
sobj from the
T
o
b
j
T_{obj}
Tobj . Therefore, the fine-tuning loss is formulated as the combination of the current known loss and the unknown loss, ensuring the model learns effectively from both known and unknown categories during training:
由于
T
k
T_k
Tk、
T
o
b
j
T_{obj}
Tobj和
T
u
n
k
T_{unk}
Tunk仅在冻结的分类头部计算相似度分数,因此没有来自边界框回归的损失,完全专注于学习类别特定信息。
T
u
n
k
T_{unk}
Tunk的软目标分数直接源自
T
o
b
j
T_{obj}
Tobj的相似度分数
s
o
b
j
s_{obj}
sobj。因此,微调损失被表述为当前已知损失和未知损失的组合,确保模型在训练过程中能有效地从已知和未知类别中学习:
L
=
L
k
(
s
k
,
o
k
)
+
Φ
(
s
o
b
j
,
u
o
b
j
)
⋅
L
u
n
k
(
s
u
n
k
,
s
o
b
j
)
(5)
\mathcal{L} = \mathcal{L}_k(s_k, o_k) + Φ(s_{obj} , u_{obj} ) · L_{unk}(s_{unk}, s_{obj} ) \tag5
L=Lk(sk,ok)+Φ(sobj,uobj)⋅Lunk(sunk,sobj)(5)
where
s
u
n
k
s_{unk}
sunk is the prediction score from “unknown” wildcard.
L
\mathcal{L}
L represents the binary cross-entropy (BCE) loss. During inference, we employ a simple and efficient unknown filtering strategy
F
\mathcal F
F for unknown class predictions
P
u
n
k
P_{unk}
Punk that have a high IoU with confident known class predictions
P
k
P_k
Pk to further de-duplicate.
其中
s
u
n
k
s_{unk}
sunk表示来自"unknown"通配符的预测分数。
L
\mathcal{L}
L代表二元交叉熵(BCE)损失函数。在推理阶段,我们采用简单高效的未知类别过滤策略
F
\mathcal F
F,对那些与高置信度已知类别预测
P
k
P_k
Pk具有高交并比(IoU)的未知类别预测
P
u
n
k
P_{unk}
Punk进行去重处理。
F
(
P
u
n
k
)
=
{
p
u
∈
P
u
n
k
∣
I
o
U
(
p
u
,
p
k
)
<
τ
,
∀
p
k
∈
P
k
}
(6)
\mathcal F(P_{unk} ) = \{p_u ∈ P_{unk} | IoU(p_u, p_k ) < τ, ∀p_k ∈ P_k \} \tag6
F(Punk)={pu∈Punk∣IoU(pu,pk)<τ,∀pk∈Pk}(6)
where
τ
τ
τ is the IoU threshold for unknown filtering.
其中
τ
τ
τ 是用于未知过滤的 IoU 阈值。
Subsequently, new categories can be discovered from the predictions of the unknown class, and their class names will be added to the vocabulary
V
\mathcal V
V, where they serve as known classes for the next iteration.
随后,可以从未知类别的预测中发现新类别,其类别名称将被添加到词汇表
V
\mathcal V
V 中,在下一次迭代中它们将作为已知类别使用。
4. Experiments 实验
4.1. Dataset 数据
We evaluate our method on two distinct setups, targeting both OVD and OWOD. Our experiments leverage diverse datasets to comprehensively assess the model’s performance in detecting known and unknown objects.
我们在两种不同的设置下评估我们的方法,目标同时涵盖开放词汇检测(OVD)和开放世界目标检测(OWOD)。实验采用多样化数据集,全面评估模型在检测已知和未知物体方面的性能。
Open-Vocabulary Object Detection: For open vocabulary detection, the model is trained on a combination of Objects365 58 and GoldG 59 datasets, and evaluated on the LVIS 18 dataset.The LVIS dataset contains 1,203 categories, exhibiting a realistic long-tailed distribution of rare, common, and frequent classes. This setup focuses on evaluating the model’s capacity to align visual and language representations, detect novel and unseen categories, and generalize across a large-scale, long-tailed dataset.
开放词汇目标检测:在开放词汇检测任务中,模型使用Objects36558和GoldG59数据集的组合进行训练,并在LVIS18数据集上进行评估。LVIS数据集包含1,203个类别,呈现出罕见类、常见类和频繁类的真实长尾分布。该实验设置重点评估模型在视觉与语言表征对齐、检测未见新类别以及在大规模长尾数据集上的泛化能力。
Open-World Object Detection: For open-world object detection, we evaluate our method on three established OWOD benchmarks:
M-OWODB: This benchmark combines the COCO 60 and PASCAL VOC 61 datasets, where known and unknown classes are mixed across tasks. It is divided into four sequential tasks. At each task, the model learns new classes while the remaining classes remain unknown.
S-OWODB: Based solely on COCO, this benchmark separates known and unknown classes by their superclass.
nu-OWODB: This benchmark is derived from 43 and based on the nuSences dataset 21.This benchmark is specifically designed to evaluate the model’s capability in autonomous driving scenarios. The nu-OWODB captures the complexity of urban driving environments, including crowded city streets, challenging weather conditions, frequent occlusions, and dense traffic with intricate interactions between objects.
开放世界物体检测:针对开放世界物体检测任务,我们在三大标准OWOD基准测试集上评估方法性能:
M-OWODB基准:该基准融合了COCO60和PASCAL VOC61数据集,将已知类别与未知类别混合分布在各项任务中。整个基准划分为四个连续任务阶段,模型需在每个阶段学习新类别,同时其余类别保持未知状态。
S-OWODB基准:基于纯COCO数据集构建,该基准通过超类别划分已知与未知类别。
nu-OWODB基准:源自文献43并基于nuSences数据集21构建,专为评估自动驾驶场景下的模型能力设计。该基准完整呈现城市驾驶环境的复杂性特征,包括:拥挤的街道、恶劣天气条件、频繁遮挡现象,以及物体间存在复杂交互关系的密集交通场景。
By incorporating these benchmarks, we assess the model’s ability to handle real-world OWOD challenges while maintaining robustness and scalability across diverse settings.
通过整合这些基准,我们评估模型在处理现实世界开放世界物体检测挑战时的能力,同时确保其在多样化场景中保持鲁棒性和可扩展性。
4.2. Evaluation Metrics 评估指标
Open-Vocabulary Evaluation: Similar to YOLO-World and other pre-trained models, we evaluate the zero-shot capability of our pre-trained model on the LVIS minival dataset, which contains the same images in COCO validation set. For fair and consistent comparison, we use standard AP metrics to measure the model’s performance.
开放词汇评估:与YOLO-World及其他预训练模型类似,我们在LVIS迷你验证集上评估预训练模型的零样本能力,该数据集包含与COCO验证集相同的图像。为确保公平一致的比较,我们采用标准平均精度(AP)指标来衡量模型性能。
Open-World Evaluation: We adapt the pretrained open-vocabulary model to the open-world scenario, enabling it to recognize both known and unknown objects. For known objects, we use mAP as the evaluation metric. To further assess catastrophic forgetting during incremental tasks, the mAP is divided into previous known (PK) and current known (CK) categories. For unknown objects, since it is impractical to exhaustively annotate all remaining objects in the scene, we employ the Recall metric to evaluate the model’s ability to detect unknown categories. Additionally, WI 38 and A-OSE 38 are used to measure the extent to which unknown objects interfere with known object predictions. However, due to their instability, these metrics are provided for reference purposes only.
开放世界评估:我们将预训练的开词汇模型适配到开放世界场景,使其能够识别已知和未知物体。对于已知物体,我们采用mAP作为评估指标。为深入评估增量任务中的灾难性遗忘问题,mAP指标被细分为先前已知(PK)和当前已知(CK)类别。针对未知物体,由于对场景中所有剩余物体进行穷尽式标注不切实际,我们采用Recall指标来评估模型检测未知类别的能力。此外,WI38和A-OSE38指标用于衡量未知物体对已知物体预测的干扰程度,但这些指标存在不稳定性,仅作为参考依据提供。
表 1. 在LVIS minival数据集上的零样本评估。我们模型的平均精度(AP)分别展示了一对一头(左)和一对多头(右)的结果。所有速度测量均在V100 GPU上使用PyTorch(未启用TensorRT)进行。 F P S f FPS^f FPSf表示不含后处理的前向传播速度。在非极大值抑制前,我们将预测框数量限制在3000以内以测量FPS。
4.3. Implementation Details 实现细节
Open-Vocabulary Detection: Our image detector follows YOLOv10 10, which provides an efficient design for dualhead training. Similar to YOLO-World 15, we utilize a pre-trained CLIP text encoder. However, we do not perform image-text fusion in the neck. Instead, we align the two modalities solely in the head using efficient adaptive decision learning. During pretraining, we incorporate lowrank matrices into the all projection layers in the CLIP text encoder. The rank for the matrices are set to
16
16
16. Our pretraining is conducted on
8
8
8 GPUs, with a batch size of
128
128
128. Both the YOLO model and LoRA parameters for the text encoder are trained with an initial learning rate of
5
×
10
−
4
5 × 10^{−4}
5×10−4 and weight decay of
0.025
0.025
0.025.
开放词汇检测:我们的图像检测器采用YOLOv10 10框架,该框架为双头训练提供了高效的设计方案。与YOLO-World 15类似,我们使用了预训练的CLIP文本编码器。但不同于在颈部网络进行图文融合的做法,我们仅通过头部网络采用高效自适应决策学习来实现双模态对齐。在预训练阶段,我们将低秩矩阵融入CLIP文本编码器的所有投影层,矩阵秩设置为
16
16
16。预训练使用
8
8
8块GPU执行,批处理大小为
128
128
128。YOLO模型和文本编码器的LoRA参数均采用初始学习率
5
×
10
−
4
5×10^{−4}
5×10−4和权重衰减系数
0.025
0.025
0.025进行训练。
Open-World Detection: All the wildcard embeddings are initialized from text features extracted from a generic text, “object”, by our calibrated text encoder. We use the same training datasets employed for open-vocabulary pretraining to fine-tune the wildcard embedding
T
o
b
j
T_{obj}
Tobj . Specifically, the wildcard embedding is trained for
3
3
3 epochs with a learning rate of
1
×
10
−
4
1 × 10^{−4}
1×10−4. Using the well-tuned wildcard as an anchor, the learning rate for fine-tuning the known and unknown class embeddings is set to
1
×
10
−
3
1 × 10^{−3}
1×10−3, with weight decay set to
0
0
0. All the other parts of model are frozen and mosaic augmentation is not applied during this stage.
开放世界检测:所有通配符嵌入均通过我们校准的文本编码器从通用文本“object”提取的文本特征初始化。我们采用与开放词汇预训练相同的训练数据集对通配符嵌入
T
o
b
j
T_{obj}
Tobj进行微调。具体而言,通配符嵌入以
1
×
10
−
4
1×10^{−4}
1×10−4的学习率训练
3
3
3个周期。以优化后的通配符作为锚点,微调已知类和未知类嵌入的学习率设置为
1
×
10
−
3
1×10^{−3}
1×10−3,权重衰减设为
0
0
0。模型其余部分参数冻结,此阶段不应用马赛克数据增强。
For training the “unknown” wildcard, pseudo-labels are selected based on an IoU threshold
σ
1
=
0.5
σ1 = 0.5
σ1=0.5 and a score threshold
σ
2
=
0.01
σ2 = 0.01
σ2=0.01. During inference, known class predictions with score greater than
0.2
0.2
0.2 are confident predictions, and
τ
=
0.99
τ = 0.99
τ=0.99. For known class detection, predictions with scores below
0.05
0.05
0.05 are filtered out as default.
为训练"unknown"通配符类别,基于IoU阈值
σ
1
=
0.5
σ1=0.5
σ1=0.5和分数阈值
σ
2
=
0.01
σ2=0.01
σ2=0.01选择伪标签。在推理阶段,分数高于
0.2
0.2
0.2的已知类别预测被视为置信预测,阈值
τ
=
0.99
τ=0.99
τ=0.99。对于已知类别检测,默认过滤掉分数低于
0.05
0.05
0.05的预测结果。
All fine-tuning experiments are conducted on
8
8
8 GPUs, with a batch size of
16
16
16 per GPU. Notably, all open-world experiments are evaluated using the one-to-one head, which not requires NMS operations for post-processing.
所有微调实验均在8块GPU上进行,每块GPU的批次大小为16。值得注意的是,所有开放世界实验均采用一对一头部进行评估,这种设计无需进行非极大值抑制(NMS)后处理操作。
4.4. Quantitative Results 定量结果
Tab. 1 demonstrates that model with efficient adaptive decision learning achieves significant zero-shot performance improvements on the LVIS benchmark, outperforming recent real-time state-of-the-art open-vocabulary models 15, 25, 62. For the small model (-S), we observe that using predictions from the one-to-one head alone improves the detection performance for rare classes by
6.4
%
6.4\%
6.4% and common classes by
3.2
%
3.2\%
3.2%. Furthermore, employing a one-tomany head structure with NMS achieves even greater performance gains. This clearly demonstrates that in previous pretraining processes, the multimodal decision boundaries were fully constructed by incorporating AdaDL. Additionally, leveraging the efficient model architecture and the nature of end-to-end detection, our approach gains faster speed and eliminates the need for NMS during inference, making it highly efficient for real-world applications.
表1表明,采用高效自适应决策学习的模型在LVIS基准测试中实现了显著的零样本性能提升,优于当前最先进的实时开放词汇模型15, 25, 62。对于小型模型(-S)而言,仅使用一对一检测头的预测结果就使稀有类别的检测性能提升了
6.4
%
6.4\%
6.4%,常见类别提升了
3.2
%
3.2\%
3.2%。而采用带非极大值抑制的一对多检测头结构则能获得更显著的性能提升。这充分证明在前期预训练过程中,通过整合自适应决策学习技术已完整构建了多模态决策边界。此外,凭借高效模型架构和端到端检测特性,我们的方法在推理时具有更快的速度且无需非极大值抑制,使其在实际应用中极具效率优势。
To address open-world demands, we adapt our welladapted open-vocabulary model to recognize unknown classes that are not present in the predefined vocabulary through wildcard learning. As shown in Tab. 2, the openvocabulary model demonstrates outstanding performance in open-world scenarios due to its rich knowledge. Through our wildcard learning strategy, the model achieves superior performance in both unknown and known class recognition compared to traditional open-world methods. Moreover, it outperforms recent open-world detection models that leverage pre-training models 63, 43, 44. Notably, our simpler and more efficient approach surpasses the state-of-the-art OVOW model 43, which is also based on YOLO-World structure. Our method achieves a significant improvement in unknown recall and known mAP, demonstrating its effectiveness and robustness in open-world detection tasks. Furthermore, we evaluated the model’s capability in realworld autonomous driving scenarios. As shown in Tab. 3, our model, using a simpler approach, achieves superior unknown detection performance compared to the other methods.
为满足开放世界的需求,我们通过通配符学习技术,将适应性强的开放词汇模型拓展至识别预定义词汇表之外的未知类别。如表2所示,得益于丰富的先验知识,该开放词汇模型在开放世界场景中展现出卓越性能。与传统开放世界方法相比,我们的通配符学习策略使模型在已知和未知类别识别上均取得优势,其表现甚至超过了近期采用预训练模型的开放世界检测方案63, 43, 44。值得注意的是,我们这种更简洁高效的方法超越了同样基于YOLO-World架构的前沿OVOW模型43,在未知类别召回率和已知类别mAP指标上实现显著提升,充分证明了其在开放世界检测任务中的有效性与鲁棒性。此外,我们在真实自动驾驶场景中的评估表明(表3),采用更简单方法的模型相较其他方案具有更优异的未知目标检测性能。
表2. M-OWODB(上)和S-OWODB(下)的OWOD结果。比较未知类别召回率(U-Recall)和已知类别的平均精度均值(mAP)。我们的方法优于传统模型和利用预训练知识的模型。OVOW*代表使用YOLO-Worldv2-S复现的版本,以确保与我们的模型在相同规模下进行公平比较。
表3. nu-OWODB评估结果。我们的方法在所有未知指标上均优于其他方法,展现了其在实际应用中的强大适应性。OVOW*代表基于YOLO-Worldv2-S复现的模型。
Benefiting from AdaDL and wildcard learning strategies, our model captures a broader range of unknown objects through wildcard embeddings while maintaining accurate recognition of known categories. Notably, as the model scales up, the capability of model to detect known and unknown objects increases progressively, which shows the effectiveness of our methods at different model scale.
得益于AdaDL和通配符学习策略,我们的模型通过通配符嵌入捕获更广泛的未知对象,同时保持对已知类别的准确识别。值得注意的是,随着模型规模的扩大,模型检测已知和未知对象的能力逐步提升,这表明我们的方法在不同模型规模下均具有有效性。
4.5. Ablation Study 消融研究
Open-Vocabulary Detection: We conducted a series of ablation studies on the small scale model to evaluate the impact of image-text fusion. Due to differences in experimental settings, we first reproduced YOLO-Worldv2-S under our setup. Interestingly, as shown in Tab. 5 our findings reveal that a smaller batch size and learning rate yield better pretraining performance, particularly improving detection for frequent classes by
2.2
%
2.2\%
2.2%. Building on this, we removed the VL-PAN structure and observed that the model’s detection accuracy remains largely unaffected. Notably, it demonstrated improved generalization for rare classes. Replacing YOLO-World’s YOLOv8 structure with YOLOv10 and using a dual-head match demonstrated that the one-tomany head benefits more from these changes, achieving improved performance over YOLO-World. However, the oneto-one head still struggled with alignment, particularly in rare class detection. To address this, we calibrate text encoder with AdaDL, making both image and text encoder learn decision boundaries simultaneously, which attains significant improvements.
开放词汇检测:我们在小规模模型上开展了一系列消融实验,以评估图文融合的影响。由于实验设置差异,我们首先在本研究环境下复现了YOLO-Worldv2-S模型。有趣的是,如表5所示,实验发现较小的批量大小和学习率能带来更好的预训练效果,尤其使高频类别的检测准确率提升
2.2
%
2.2\%
2.2%。基于此,我们移除了VL-PAN结构后观察到模型检测精度基本不受影响,反而显著提升了稀有类别的泛化能力。将YOLO-World的YOLOv8结构替换为YOLOv10并采用双头匹配机制表明,一对多检测头从中获益更多,性能表现超越原YOLO-World模型。但一对一检测头仍存在对齐困难,尤其在稀有类别检测方面。为此,我们采用AdaDL方法对文本编码器进行校准,使图像和文本编码器能同步学习决策边界,从而获得显著性能提升。
表5. 预训练设置消融实验。*表示我们实验设置中的复现版本。w/o VL-PAN表示在YOLO-Worldv2中去除RepVL-PAN结构。
As shown in Tab. 6, we compare the different metheds for AdaDL to calibrate text encoder. Performing full finetuning improves overall accuracy but reduces performance on rare classes, likely due to overfitting. We assume this is caused by the large gap between the number of image and text training parameters. Introducing parameterefficient methods like prompt tuning 53 and deep prompt tuning 49 significantly improved alignment, enabling the one-to-one head to match the one-to-many head in performance. And as the training parameters increase, the performance also improves. Finally, using LoRA for text encoder across all projection layers further adapt text information to be region-aware. This approach yielded the best overall results and was adopted for our final experiments.
如表6所示,我们比较了AdaDL校准文本编码器的不同方法。执行全参数微调虽能提升整体准确率,但会降低稀有类别的性能,这可能是由于过拟合所致。我们推测这种现象源于图像与文本训练参数量级间的巨大差距。引入提示词微调53和深度提示微调49等高效参数方法后,文本-图像对齐效果显著提升,使得单头匹配架构达到与多头相当的性能。随着训练参数增加,模型表现持续改善。最终,我们在所有投影层对文本编码器采用LoRA技术,使文本信息适配区域感知特性。该方案取得了最佳综合效果,并被采用为最终实验方案。
表6. AdaDL方法消融实验。我们对自适应决策学习的不同方法进行了消融分析。
Open-World Detection: We compared the performance of close-set YOLOv10 trained with unknown class labels (oracle) and zero-shot performance of our open-vocabulary model on the M-OWODB dataset. The results show in Tab. 4, even in a zero-shot setting, our open-vocabulary model achieves higher known class accuracy than the oracle-trained YOLOv10 model. Moreover, when we only simply use vanilla “object” as text input, it achieves better Unknown recall than traditional owod methods, which further validates the effectiveness of our open-vocabualry methods. By applying our wildcard embeddings, the model’s unknown detection capability is fully unlocked, surpassing the performance of models trained with oracle supervision on unknown labels across different tasks. And as the model scales up, its ability to detect known and unknown class increases simultaneously.
开放世界检测:我们在M-OWODB数据集上比较了使用未知类别标签(oracle)训练的闭集YOLOv10模型与开放词汇模型的零样本性能。如表4所示,即使在零样本设置下,我们的开放词汇模型也取得了比oracle训练的YOLOv10模型更高的已知类别准确率。更重要的是,当我们仅使用简单的"object"作为文本输入时,其未知类别召回率已优于传统开放世界目标检测方法,这进一步验证了我们开放词汇方法的有效性。通过应用通配符嵌入技术,模型对未知物体的检测能力得到全面释放,在不同任务中都超越了使用oracle监督训练的模型性能。随着模型规模的扩大,其对已知和未知类别的检测能力实现了同步提升。
4.6. Qualitative Result 定性结果
For the open-vocabulary model, we input the 1,023 category names from the LVIS dataset as prompt, comparing the zero-shot performance on LVIS with YOLO-Worldv2, as shown in Fig. 5. It shows that our AdaDL strategy enhances the model’s decision boundaries to detect objects of varying sizes, distances, or those partially occluded, with higher confidence scores. Moreover, the improved alignment between visual and calibrated semantic information enables the model to correctly classify detected objects, capturing more diverse categories.
对于开放词汇模型,我们将LVIS数据集中的1023个类别名称作为提示输入,与YOLO-Worldv2在LVIS上的零样本性能进行比较,如图5所示。实验表明,我们的AdaDL策略通过提升置信度分数,增强了模型对各类尺寸、距离或被部分遮挡物体的决策边界检测能力。此外,改进后的视觉信息与校准语义信息对齐机制,使模型能更准确地对检测物体进行分类,从而识别更多样化的类别。
图5. LVIS零样本推理可视化结果。我们展示了YOLO-Worldv2在小规模场景下使用LVIS 1023个类别名称作为文本提示的可视化结果。采用我们策略预训练的模型展现出卓越能力:既能检测复杂场景中的物体,又能识别更广泛的新颖类别。
In Fig. 6, we compare the performance of an openvocabulary model using text embeddings for all 80 known classes in the M-OWODB dataset with our model, which uses text embeddings for only half of the known classes (similar to the Task 2 scenario) and an extra “unknown” wildcard to detect unknown objects. The results demonstrate that our model not only identifies the remaining 40 unknown classes without corresponding text inputs but also detects additional objects. This indicates that the “unknown” wildcard effectively retains the rich semantic knowledge from pretraining while learning downstream task-specific knowledge, showcasing strong generalization capabilities that align with real-world requirements.
在图6中,我们将使用M-OWODB数据集中全部80个已知类别文本嵌入的开放词汇模型性能,与我们提出的模型进行了对比。我们的模型仅使用半数已知类别的文本嵌入(类似任务2场景),并增加了一个"未知"通配符来检测未知物体。实验结果表明,该模型不仅能识别没有对应文本输入的剩余40个未知类别,还能检测到更多物体。这表明"未知"通配符在保持预训练丰富语义知识的同时,能有效学习下游任务特定知识,展现出符合实际应用需求的强大泛化能力。
图6. M-OWODB数据集上的可视化结果。与使用全部80个类别提示词的开放词汇模型相比,我们扩展至开放世界的方法仅采用40个类别嵌入向量,并额外增加了一个"unknown"通配符。
5. Conclusion 结论
In this work, we propose Universal Open-World Object Detection (Uni-OWD), a new paradigm to tackle the challenges of dynamic object categories and unknown target recognition within a unified framework. To address this, we introduce YOLO-UniOW, an efficient solution based on the YOLO detector. Our framework incorporates several innovative strategies: the Adaptive Decision Learning (AdaDL) strategy, which seamlessly adapts decision boundaries for Uni-OWD tasks, and Wildcard Learning, using a “unknown” wildcard embedding to enable the detection of unknown objects, supporting iterative vocabulary expansion without incremental learning. Extensive experiments across benchmarks for both open-vocabulary and open-world object detection validate the effectiveness of our approach. The results demonstrate that YOLOUniOW significantly outperforms state-of-the-art methods, offering a versatile and superior solution for open-world object detection. This work highlights the potential of our framework for real-world applications, paving the way for further advancements in this evolving field.
在这项工作中,我们提出了通用开放世界目标检测(Uni-OWD)新范式,通过统一框架解决动态物体类别与未知目标识别的双重挑战。为此,我们基于YOLO检测器开发了高效解决方案YOLO-UniOW。该框架整合了多项创新策略:自适应决策学习(AdaDL)策略能无缝调整Uni-OWD任务的决策边界;通配符学习采用"未知"嵌入向量实现未知物体检测,支持无需增量学习的词汇表迭代扩展。我们在开放词汇和开放世界目标检测基准上的大量实验验证了该方法的有效性。结果表明YOLO-UniOW显著优于现有最优方法,为开放世界目标检测提供了通用且卓越的解决方案。本工作凸显了该框架在现实应用中的潜力,为这一演进领域的持续发展开辟了新路径。
References 参考资料
Leqi Shen, Tao He, Sicheng Zhao, Zhelun Shen, Yuchen Guo, Tianshi Xu, and Guiguang Ding. X-reid: Cross-instance transformer for identity-level person reidentification. In 2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024. 1 ↩︎ ↩︎
Fan Yang, Xinhao Xu, Hui Chen, Yuchen Guo, Yuwei He, Kai Ni, and Guiguang Ding. Gpro3d: Deriving 3d bbox from ground plane in monocular 3d object detection. Neurocomputing, 562:126894, 2023. 1 ↩︎ ↩︎
Yuchen Guo, Yuwei He, Jinhao Lyu, Zhanping Zhou, Dong Yang, Liangdi Ma, Hao-tian Tan, Changjian Chen, Wei Zhang, Jianxing Hu, et al. Deep learning with weak annotation from diagnosis reports for detection of multiple head disorders: a prospective, multicentre study. The Lancet Digital Health, 4(8):e584–e593, 2022. 1 ↩︎ ↩︎
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015. 1 ↩︎ ↩︎
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. 1 ↩︎ ↩︎
Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll ́ar. Focal loss for dense object detection.CoRR, abs/1708.02002, 2017. 1 ↩︎ ↩︎
Alexey Bochkovskiy, Chien-Yao Wang, and HongYuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. CoRR, abs/2004.10934, 2020. 1 ↩︎ ↩︎
Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics Yolov8. https://github.com/ultralytics/ ultralytics, 2023. 1, 2 ↩︎ ↩︎ ↩︎ ↩︎
Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018. 1 ↩︎ ↩︎
Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time end-toend object detection, 2024. 1, 2, 5, 7 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12655–12663, 2020. 2 ↩︎ ↩︎
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021. 2 ↩︎ ↩︎
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 3, 4 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023. 2 ↩︎ ↩︎
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 2, 3, 4, 7 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021. 2 ↩︎ ↩︎
Ao Wang, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Repvit: Revisiting mobile cnn from vit perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15909–15920, 2024. 2 ↩︎ ↩︎
Agrim Gupta, Piotr Doll ́ar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation, 2019. 2, 6 ↩︎ ↩︎ ↩︎ ↩︎
Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757–1772, 2012. 2 ↩︎ ↩︎
Akshita Gupta, Sanath Narayan, K J Joseph, Salman Khan, Fahad Shahbaz Khan, and Mubarak Shah. Ow-detr: Openworld detection transformer, 2022. 2 ↩︎ ↩︎
Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving, 2020. 2, 6 ↩︎ ↩︎ ↩︎ ↩︎
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022. 3 ↩︎ ↩︎
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, pages 38–55. Springer, 2025. 3, 4
[^29] Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024. 3 ↩︎ ↩︎ ↩︎ ↩︎Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, and Xiaojuan Qi. Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. Advances in neural information processing systems, 36, 2024. 3 ↩︎ ↩︎
Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the" edge" of open-set object detection. arXiv preprint arXiv:2405.10300, 2024. 3, 4, 7 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pretraining for open-world detection. Advances in Neural Information Processing Systems, 35:9125–9138, 2022. 3 [^60]: Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. Detclipv2: Scalable open-vocabulary object detection pre-training via wordregion alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23497–23506, 2023. 3 ↩︎ ↩︎
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, JenqNeng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022. 3, 4 ↩︎ ↩︎ ↩︎
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining, 2021. 3 ↩︎
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation, 2022. 3 ↩︎
Zongyang Ma, Guan Luo, Jin Gao, Liang Li, Yuxin Chen, Shaoru Wang, Congxuan Zhang, and Weiming Hu. Openvocabulary one-stage detection with hierarchical visuallanguage knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14074–14083, 2022. 3 ↩︎
Chau Pham, Truong Vu, and Khoi Nguyen. Lp-ovod: Openvocabulary object detection by linear probing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 779–788, 2024. 3 ↩︎
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision transformer distills itself for open-vocabulary dense prediction.arXiv preprint arXiv:2310.01403, 2023. 3
[^54] Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7031–7040, 2023. 3 ↩︎Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 3 ↩︎ ↩︎ ↩︎
Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Promptdet: Towards open-vocabulary detection using uncurated images. In European Conference on Computer Vision, pages 701–717. Springer, 2022. 3 ↩︎
Mengyao Lyu, Jundong Zhou, Hui Chen, Yijie Huang, Dongdong Yu, Yaqian Li, Yandong Guo, Yuchen Guo, Liuyu Xiang, and Guiguang Ding. Box-level active detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23766–23775, 2023. 3 ↩︎ ↩︎
Soumya Roy, Asim Unmesh, and Vinay P Namboodiri. Deep active learning for object detection. In BMVC, page 91, 2018. 3 ↩︎ ↩︎
Tianning Yuan, Fang Wan, Mengying Fu, Jianzhuang Liu, Songcen Xu, Xiangyang Ji, and Qixiang Ye. Multiple instance active learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5330–5339, 2021. 3 ↩︎ ↩︎
K J Joseph, Salman Khan, Fahad Shahbaz Khan, and Vineeth N Balasubramanian. Towards open world object detection, 2021. 3, 6, 7, 8 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Ruohuan Fang, Guansong Pang, Lei Zhou, Xiao Bai, and Jin Zheng. Unsupervised recognition of unknown objects for open-world object detection, 2023. 3 ↩︎ ↩︎
Zhicheng Sun, Jinghan Li, and Yadong Mu. Exploring orthogonality in open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17302–17312, 2024. 3 ↩︎ ↩︎
Zhiheng Wu, Yue Lu, Xingyu Chen, Zhengxing Wu, Liwen Kang, and Junzhi Yu. Uc-owod: Unknown-classified open world object detection. In European Conference on Computer Vision, pages 193–210. Springer, 2022. 3 ↩︎ ↩︎
Jinan Yu, Liyan Ma, Zhenglin Li, Yan Peng, and Shaorong Xie. Open-world object detection via discriminative class prototype learning. In 2022 IEEE International Conference on Image Processing (ICIP), page 626–630. IEEE, 2022. 3 ↩︎ ↩︎
Zizhao Li, Zhengkang Xiang, Joseph West, and Kourosh Khoshelham. From open vocabulary to open world: Teaching vision language models to detect novel objects, 2024. 3, 6, 8 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Shuailei Ma, Yuefeng Wang, Ying Wei, Jiaqi Fan, Xinyu Sun, Peihao Chen, and Enming Zhang. A simple knowledge distillation framework for open-world object detection.arXiv preprint arXiv:2312.08653, 2023. 3, 8 ↩︎ ↩︎ ↩︎ ↩︎
Orr Zohar, Alejandro Lozano, Shelly Goel, Serena Yeung, and Kuan-Chieh Wang. Open world object detection in the era of foundation models. arXiv preprint arXiv:2312.05745, 2023. 3 ↩︎ ↩︎
Yuqing Ma, Hainan Li, Zhange Zhang, Jinyang Guo, Shanghang Zhang, Ruihao Gong, and Xianglong Liu. Annealingbased label-transfer learning for open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11454–11463, 2023. 3 ↩︎
Xiaowei Zhao, Xianglong Liu, Yifan Shen, Yixuan Qiao, Yuqing Ma, and Duorui Wang. Revisiting open world object detection, 2022. 3 ↩︎
Tianxiang Hao, Hui Chen, Yuchen Guo, and Guiguang Ding. Consolidator: Mergeable adapter with grouped connections for visual adaptation, 2023. 3 ↩︎ ↩︎
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023. 3, 9 ↩︎ ↩︎ ↩︎ ↩︎
Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jungong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, and Guiguang Ding. Pyra: Parallel yielding re-activation for training-inference efficient task adaptation. In European Conference on Computer Vision, pages 455–473. Springer, 2025. 3 ↩︎ ↩︎
Hui-Yue Yang, Hui Chen, Ao Wang, Kai Chen, Zijia Lin, Yongliang Tang, Pengcheng Gao, Yuming Quan, Jungong Han, and Guiguang Ding. Promptable anomaly segmentation with sam through self-perception tuning. arXiv preprint arXiv:2411.17217, 2024. 3 ↩︎ ↩︎
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022. 3 ↩︎ ↩︎
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022. 3, 9 ↩︎ ↩︎ ↩︎ ↩︎
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.Lora: Low-rank adaptation of large language models, 2021. 3 ↩︎ ↩︎
Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024. 3 ↩︎ ↩︎ ↩︎ ↩︎
Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation. arXiv preprint arXiv:2310.17513, 2023. 3 ↩︎ ↩︎
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. 4 ↩︎ ↩︎
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8429–8438, 2019. 6 ↩︎ ↩︎
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr – modulated detection for end-to-end multi-modal understanding, 2021. 6 ↩︎ ↩︎
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll ́ar. Microsoft coco: Common objects in context, 2015. 6 ↩︎ ↩︎
M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015. 6 ↩︎ ↩︎
Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, and Kyusong Lee. Real-time transformer-based open-vocabulary detection with efficient fusion head, 2024. 7 ↩︎ ↩︎
Ruohuan Fang, Guansong Pang, Lei Zhou, Xiao Bai, and Jin Zheng. Unsupervised recognition of unknown objects for open-world object detection. arXiv preprint arXiv:2308.16527, 2023. 8 ↩︎ ↩︎