大模型论文 | 多模态大模型最新进展

最新推荐文章于 2025-11-26 15:49:07 发布

原创最新推荐文章于 2025-11-26 15:49:07 发布 · 1.2k 阅读

28 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #AI大模型 #语言模型 #ai #LLM #多模态

1.Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces

Authors: Amirreza Payandeh, Daeun Song, Mohammad Nazeri, Jing Liang, Praneel Mukherjee, Amir Hossain Raj, Yangzhe Kong, Dinesh Manocha, Xuesu Xiao

Affiliations: George Mason University; University of Maryland, College Park

https://arxiv.org/abs/2501.09024

论文摘要

Most existing social robot navigation techniques either leverage hand-crafted rules or human demonstrations to connect robot perception to socially compliant actions. However, there remains a significant gap in effectively translating perception into socially compliant actions, much like how human reasoning naturally occurs in dynamic environments. Considering the recent success of Vision-Language Models (VLMs), we propose using language to bridge the gap in human-like reasoning between perception and socially aware robot actions. We create a vision-language dataset, Social robot Navigation via Explainable Interactions (SNEI), featuring 40K human-annotated Visual Question Answers (VQAs) based on 2K human-robot social interactions in unstructured, crowded public spaces, spanning perception, prediction, chain-of-thought reasoning, action, and explanation. We fine-tune a VLM, Social-LLaVA, using SNEI to demonstrate the practical application of our dataset. Social-LLaVA outperforms state-of-the-art models like GPT-4V and Gemini, based on the average of fifteen different human-judge scores across 50 VQA. Deployed onboard a mobile robot, Social-LLaVA enables human-like reasoning, marking a promising step toward socially compliant robot navigation in dynamic public spaces through language reasoning.

论文简评：这篇关于社会机器人导航的研究论文非常引人注目，它提出了一个名为Social-LLaVA的视觉语言模型，并且提供了包含40,000个基于社交互动的社会机器人导航任务的人工标注数据集（SNEI）。这些数据集旨在通过模仿人类的行为来增强机器人在社会环境中导航的能力。该研究的主要贡献在于：首先，它提供了一个高质量的社会机器人导航数据集，这对于解决当前社会机器人的认知问题具有重要意义；其次，它探索了如何利用视觉与语言模型来实现更有效的社会机器人行为理解，从而为未来的发展方向提供了一定的指导；最后，初步结果表明，Social-LLaVA在生成导航指令方面表现出色，有望成为未来社会机器人导航技术的重要里程碑。总之，本文的研究工作对社会机器人领域具有重要的理论价值和实践意义，值得进一步深入研究和应用。

2.Vision-Language Models Do Not Understand Negation

Authors: Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi

Affiliations: MIT; GoogleDeepMind; University of Oxford

https://arxiv.org/abs/2501.09425

论文摘要

Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.

在这里插入图片描述

论文简评：本文对NegBench这一针对视觉语言模型（VLM）中否定理解能力的基准进行了介绍。它展示了当前的VLM在处理否定时表现不佳，经常处于随机猜测水平。作者提出了一个以数据为中心的方法，并使用大型合成数据集中的否定描述来训练CLIP模型，从而显著提高了否定任务的表现。通过这种方法，实验结果表明了VLM在否定理解和处理上的局限性，以及该方法带来的改进。此外，还讨论了不同数据集和任务的实验结果，进一步强调了当前模型在否定理解方面的不足。总体而言，这篇文章为研究VLM中的否定理解提供了有价值的数据集和实验框架，对于提高VLM在这些领域的性能具有重要意义。

3.LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

Authors: Kyeongha Rho, Hyeongkeun Lee, Valentio Iverson, Joon Son Chung

Affiliations: KAIST; University of Waterloo

https://arxiv.org/abs/2501.09291

论文摘要

Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.

在这里插入图片描述

论文简评： LAVCap是一项在多模态语言建模（MLM）领域的重要贡献，提供了一个新的视角，通过最优传输理论来弥合音频和视觉信息之间的差距。作者提出了他们的创新框架LAVCap，结合了深度学习和最优传输理论的原理。该方法不仅实现了高效对齐，还显著提高了生成字幕的整体质量，从而增强了最终输出的准确性。

作者进行的实验评估表明，LAVCap在AudioCaps数据集上的表现优于其竞争对手，显现出比现有最先进方法更卓越的性能。这些发现强调了将视觉信息纳入训练过程的重要性，以实现更连贯和准确的音频字幕。此外，包含的人工评估突出了LAVCap的实际应用潜力，强调了其在文本转语音合成和语音识别系统等多个领域的影响。

总的来说，LAVCap在多模态语言处理领域代表了一个显著的进步，展示了利用最优传输解决多媒体内容理解挑战的力量。该论文的贡献不仅推动了我们对多模态学习的理解，也为该领域未来的研究方向提供了宝贵的见解。整体而言，LAVCap是机器学习在提升自然语言处理能力方面的一个引人注目的例证。

4.Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition

Authors: Takaaki Hori, Martin Kocour, Adnan Haider, Erik McDermott, Xiaodan Zhuang

Affiliations: Apple; Brno University of Technology

https://arxiv.org/abs/2501.09258

论文摘要

This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). Although shallow fusion is the most common approach to incorporate language models into E2E-ASR decoding, we face two practical problems with LLMs. (1) LLM inference is computationally costly. (2) There may be a vocabulary mismatch between the ASR model and the LLM. To resolve this mismatch, we need to retrain the ASR model and/or the LLM, which is at best time-consuming and in many cases not feasible. We propose delayed fusion, which applies LLM scores to ASR hypotheses with a delay during decoding and enables easier use of pre-trained LLMs in ASR tasks. This method can reduce not only the number of hypotheses scored by the LLM but also the number of LLM inference calls. It also allows re-tokenization of ASR hypotheses during decoding if ASR and LLM employ different tokenizations. We demonstrate that delayed fusion provides improved decoding speed and accuracy compared to shallow fusion and N-best rescoring using the LibriHeavy ASR corpus and three public LLMs, OpenLLaMA 3B & 7B and Mistral 7B.

在这里插入图片描述

论文简评：本文探讨了一种名为"延迟融合"的新方法，旨在将大型语言模型（LLM）整合到端到端自动语音识别（E2E-ASR）系统中。该研究提出了一种创新的方法来解决大型语言模型与端到端自动语音识别系统之间的计算成本和词汇不匹配问题。通过在LibriHeavy语料库上的实验，作者展示了延迟融合对解码速度和准确性的改进，为延迟融合的应用提供了有力证据。此外，这种技术允许使用预训练的LLM而不必重新训练ASR模型，使其更适合于实际应用。总之，这篇文章提出了一个有效的解决方案，并显示了其在提高解码速度和准确性方面的显著效果。

5.Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

Authors: Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, Dongsheng Li

Affiliations: The Chinese University of Hong Kong; Microsoft Research Asia; The Chinese University of Hong Kong, Shenzhen Research Institute (SZRI)

https://arxiv.org/abs/2501.09695

论文摘要

Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.

在这里插入图片描述

论文简评：这篇论文主要探讨了如何通过引入专家反馈来优化大型视觉语言模型（Large Vision-Language Models）中的直接偏好优化（Direct Preference Optimization）。通过对现有研究的深入分析，作者发现直接偏好优化算法需要大量的训练数据才能获得良好的性能。因此，他们提出了一个新框架，名为OPA-DPO，旨在解决这个问题。

OPA-DPO框架的核心在于利用专家反馈对学习过程进行监督，从而提高模型的泛化能力。理论分析部分详细阐述了该方法的工作原理，并提供了相应的数学证明。这些理论基础为实际应用提供了一定程度的指导。

实验结果表明，在使用较少的数据的情况下，OPA-DPO可以显著改善直接偏好优化的结果。相比现有的直接偏好优化算法，它能够实现更高的准确率和更好的鲁棒性。这一成果对于解决大规模预训练任务中的问题具有重要意义。

综上所述，本文提出的OPA-DPO框架不仅解决了直接偏好优化中常见的问题，并且其理论分析和实验证明了其优越性。未来的研究可以进一步探索如何更有效地利用有限的数据资源，以及如何进一步优化OPA-DPO的方法以提升其整体性能。