DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM论文解读

本文链接：https://blog.youkuaiyun.com/weixin_38252409/article/details/144901264

前言

感觉好久都没有写个一篇论文解读了，刚好此篇论文激发MLLM模型的检测能力。我也想一探究竟，因此我给出此篇论文的解读。我们提出了DetToolChain，这是一种新的提示范式，旨在释放多模态大语言模型（MLLMs），如GPT-4V和Gemini的零样本目标检测能力。我们的方法包含一个受高精度检测先验启发的检测提示工具包，以及一个新的思维链来实现这些提示。具体来说，工具包中的提示被设计用来引导MLLM关注区域信息（例如，放大），按照测量标准读取坐标（例如，叠加尺子和圆规），并从上下文信息中推断（例如，叠加场景图）。基于这些工具，新的检测思维链可以自动将任务分解为简单的子任务，诊断预测结果，并规划逐步的边界框改进。我们的框架在一系列检测任务上的有效性得到了验证，尤其是在处理困难案例时。与现有的最先进方法相比，使用我们的DetToolChain的GPT-4V在MS COCO新颖类别的开放词汇检测上提升了21.5%的AP50，在RefCOCO验证集的零样本指代表达理解任务上准确率提高了24.23%，在D-cube描述对象检测FULL设定下提升了14.5%的AP。

论文地址：https://arxiv.org/pdf/2403.12488

一、引言

Large language models (LLMs), e.g., GPT3 [6], Gemini [42], InternLM [43], and Qwen [2], have shown unprecedented capabilities in understanding human languages and solving practical problems such as scientific question answering and code generation. When integrated with visual encoders, large language models can be upgraded to multimodal large language models (MLLMs), which can
achieve the ability similar to human visual intelligence and tackle some visual understanding tasks, such as image captioning. Despite these advances, the potential of MLLMs in detection tasks is still underestimated among common vision tasks [45,58,63,67]. When required to ask precise coordinates in complicated object detection tasks, e.g., detecting highly occluded objects, rotational objects, or small objects in the scene images, MLLMs often miss the target objects or answer inaccurate bounding boxs [58]. The poor performance on object detection significantly limits the applications of MLLMs in the real world, e.g., defect detection [8, 16, 70] and sports analysis [7, 44, 46].
大型语言模型（LLMs），例如GPT3 [6]、Gemini [42]、InternLM [43]以及Qwen [2]，展现了前所未有的能力来理解人类语言并解决实际问题，如科学问答和代码生成。当与视觉编码器结合时，大型语言模型可以升级为多模态大型语言模型（MLLMs），从而获得类似于人类视觉智能的能力，并处理一些视觉理解任务，比如图像字幕生成。尽管取得了这些进步，MLLMs在检测任务中的潜力在常见的视觉任务中仍然被低估[45,58,63,67]。当要求在复杂的对象检测任务中提供精确坐标时，例如检测高度遮挡的对象、旋转对象或场景图像中的小对象，MLLMs常常会错过目标对象或给出不准确的边界框[58]。对象检测上的不良表现极大地限制了MLLMs在现实世界的应用，例如缺陷检测[8, 16, 70]和运动分析[7, 44, 46]。

To enhance the detection capabilities of MLLMs, prior efforts can be categorized into two classes: (1) Finetuning MLLMs with high-quality questionanswer instructions with abundant location information [3, 12, 27, 27, 34] in the answer. Despite the considerable improvements achieved, preparing high-quality question-answer pairs requires great manual efforts and finetuning multimodal large language models suffers from large computational costs. Furthermore, since current state-of-the-art MLLMs [1, 3, 42] are closed-sourced and their performances have been significantly superior to open-sourced models, the “finetuning” method can not be implemented on the most powerful MLLMs at the moment (and most likely in the future), which significantly limits its potential to continuously improve the emerging state-of-the-art MLLMs. (2) Designing textual or visual prompting with location information to advance the localization ability of MLLMs. While intuition-based prompting methods have greatly advanced the performance of regional comprehension tasks such as compositional reasoning [33] and spatial understanding [10], their effectiveness on detection tasks remains underexplored.
为了增强MLLMs的检测能力，先前的努力可以分为两类：(1) 使用包含丰富位置信息的高质量问答指令对MLLMs进行微调[3, 12, 27, 27, 34]。尽管取得了一定的改进，但准备高质量的问答对需要大量的手动工作，而且微调多模态大型语言模型伴随着巨大的计算成本。此外，由于当前最先进的MLLMs [1, 3, 42]是闭源的，其性能已经显著优于开源模型，因此“微调”方法不能应用于目前最强大的MLLMs（且很可能在未来也是如此），这大大限制了其不断改进新兴的最先进MLLMs的潜力。(2) 设计带有位置信息的文本或视觉提示以提高MLLMs的定位能力。虽然基于直觉的提示方法已经在区域理解任务（如组合推理[33]和空间理解[10]）上大幅推进了性能，但它们在检测任务上的有效性仍有待深入研究。

This work explores how the detection ability of multimodal large language models can be unlocked by a new chain of thoughts on detection prompting toolkits (dubbed as DetToolchain). The new DetToolchain is motivated by three ideas. First, visual prompts are identified as a crucial component of the detection prompting toolkits. They offer a more direct and intuitive approach to enhancing the spatial comprehension of Multimodal Large Language Models (MLLMs) compared to language prompts. This is because current MLLMs still struggle with accurately translating textual coordinates and descriptions into precise regions and visual information. Instead, visual prompts directly drawn in the image can significantly narrow the gap between visual and textual information and ultimately contribute to the improved detection ability of MLLMs. Second, detecting challenging instances, such as occluded and small objects, can be more
efficiently tackled by breaking them down into smaller, simpler subtasks. Third, the detection results should be modified step by step using Chain-of-Thought, similar to the progressive refinement of bounding boxes in current state-of-theart object detection algorithms such as DETR [9], SparseRCNN [41], and DiffusionDet [14]. Based on these ideas, our proposed DetToolChain consists of a detection prompts toolkit, including visual processing prompts and detection reasoning prompts, and a multimodal Chain-of-Thought to properly apply these detection prompts to unleash the detection ability of MLLMs. Their critical designs and insights involve:
这项工作探讨了如何通过关于检测提示工具包的新思路（命名为DetToolchain）来解锁多模态大型语言模型的检测能力。新的DetToolchain受到三个理念的启发。首先，视觉提示被确定为检测提示工具包的关键组成部分。它们提供了比语言提示更直接和直观的方法来增强多模态大型语言模型（MLLMs）的空间理解能力。这是因为当前的MLLMs在将文本坐标和描述准确转化为精确区域和视觉信息方面仍存在困难。相反，在图像中直接绘制的视觉提示可以显著缩小视觉和文本信息之间的差距，并最终有助于提升MLLMs的检测能力。其次，检测挑战性实例，如遮挡和小对象，可以通过将其分解为更小、更简单的子任务来更有效地处理。第三，应该使用思维链逐步修改检测结果，类似于当前最先进的对象检测算法（如DETR [9]、SparseRCNN [41]和DiffusionDet [14]）中的边界框渐进式细化。基于这些理念，我们提出的DetToolChain包括一个检测提示工具包，其中包含视觉处理提示和检测推理提示，以及一个多模态思维链，以正确应用这些检测提示，释放MLLMs的检测能力。其关键设计和见解涉及：

(1) A comprehensive set of visual processing prompts that support a wide range of detection tasks. The visual processing prompts process the given images to facilitate the detection performance of MLLMs, according to some well-accepted prior knowledge and techniques in effective detectors. Specifically, the visual processing prompts can be divided into three categories, i.e., the regional amplifier, the spatial measurement standard, and the scene image parser. These prompts facilitate different key factors for better detectors. First, the regional amplifier consists of image splitting and zooming in, capable of highlighting the region of interest in detection tasks. Second, the spatial measurement standard includes rulers and compass with linear graduations, which provide transnational references and rotational references in object detection (particularly in rotated object detection), respectively. Similar to human intelligence, these spatial measurement standard can help MLLMs to locate the objects and read their coordinates out. Third, the scene parser marks the predicted positions or spatial relations of objects in the images by convex hull, bounding boxes of objects and scene graphs of the image. The markers by the scene parser can facilitate the detection capability of MLLMs by encouraging them to reason
from contextual information in the scene images.
(1) 一套全面的视觉处理提示，支持广泛的检测任务。这些视觉处理提示根据一些在高效检测器中被广泛接受的先验知识和技术来处理给定的图像，以促进MLLMs的检测性能。具体来说，视觉处理提示可以分为三个类别，即区域放大器、空间测量标准和场景图像解析器。这些提示有助于提升检测器的不同关键因素。首先，区域放大器包括图像分割和放大，能够突出显示检测任务中的感兴趣区域。其次，空间测量标准包含带有线性刻度的尺子和圆规，它们分别在对象检测（特别是在旋转对象检测）中提供跨距参考和旋转参考。与人类智能类似，这些空间测量标准可以帮助MLLMs定位对象并读取其坐标。第三，场景解析器通过凸包、对象边界框和图像场景图标记图像中预测的对象位置或空间关系。场景解析器提供的标记可以通过鼓励MLLMs从场景图像中的上下文信息进行推理，从而增强其检测能力。

(2) A comprehensive set of detection reasoning prompts that help MLLMs to diagnose the detection results and reason the next visual
processing prompts to be applied. Different from visual processing prompts, the detection reasoning prompts do not process the images but are dedicated to evaluating the predicted bounding boxes and diagnosing the inaccurate predictions even using the visual processing prompts. By analyzing the relationship and reasoning the co-occurrence of detected objects in the scene image with the commonsense knowledge in MLLM, these