1.Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces
Authors: Amirreza Payandeh, Daeun Song, Mohammad Nazeri, Jing Liang, Praneel Mukherjee, Amir Hossain Raj, Yangzhe Kong, Dinesh Manocha, Xuesu Xiao
Affiliations: George Mason University; University of Maryland, College Park
https://arxiv.org/abs/2501.09024
论文摘要
Most existing social robot navigation techniques either leverage hand-crafted rules or human demonstrations to connect robot perception to socially compliant actions. However, there remains a significant gap in effectively translating perception into socially compliant actions, much like how human reasoning naturally occurs in dynamic environments. Considering the recent success of Vision-Language Models (VLMs), we propose using language to bridge the gap in human-like reasoning between perception and socially aware robot actions. We create a vision-language dataset, Social robot Navigation via Explainable Interactions (SNEI), featuring 40K human-annotated Visual Question Answers (VQAs) based on 2K human-robot social interactions in unstructured, crowded public spaces, spanning perception, prediction, chain-of-thought reasoning, action, and explanation. We fine-tune a VLM, Social-LLaVA, using SNEI to demonstrate the practical application of our dataset. Social-LLaVA outperforms state-of-the-art models like GPT-4V and Gemini, based on the average of fifteen different human-judge scores across 50 VQA. Deployed onboard a mobile robot, Social-LLaVA enables human-like reasoning, marking a promising step toward socially compliant robot navigation in dynamic public spaces through language reasoning.
论文简评: 这篇关于社会机器人导航的研究论文非常引人注目,它提出了一个名为Social-LLaVA的视觉语言模型,并且提供了包含40,000个基于社交互动的社会机器人导航任务的人工标注数据集(SNEI)。这些数据集旨在通过模仿人类的行为来增强机器人在社会环境中导航的能力。该研究的主要贡献在于:首先,它提供了一个高质量的社会机器人导航数据集,这对于解决当前社会机器人的认知问题具有重要意义;其次,它探索了如何利用视觉与语言模型来实现更有效的社会机器人行为理解,从而为未来的发展方向提供了一定的指导;最后,初步结果表明,Social-LLaVA在生成导航指令方面表现出色,有望成为未来社会机器人导航技术的重要里程碑。总之,本文的研究工作对社会机器人领域具有重要的理论价值和实践意义,值得进一步深入研究和应用。
2.Vision-Language Models Do Not Understand Negation
Authors: Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi
Affiliations: MIT; GoogleDeepMind; University of Oxford
https://arxiv.org/abs/2501.09425
论文摘要
Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.
论文简评: 本文对NegBench这一针对视觉语言模型(VLM)中否定理解能力的基准进行了介绍。它展示了当前的VLM在处理否定时表现不佳,经常处于随机猜测水平。作者提出了一个以数据为中心的方法,并使用大型合成数据集中的否定描述来训练CLIP模型,从而显著提高了否定任务的表现。通过这种方法,实验结果表明了VLM在否定理解和处理上的局限性,以及该方法带来的改进。此外,还讨论了不同数据集和任务的实验结果,进一步强调了当前模型在否定理解方面的不足。总体而言,这篇文章为研究VLM中的否定理解提供了有价值的数据集和实验框架,对于提高VLM在这些领域的性能具有重要意义。
3.LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
Authors: Kyeongha Rho, Hyeongkeun Lee, Valentio Iverson, Joon Son Chung
Affiliations: KAIST; University of Waterloo
https://arxiv.org/abs/2501.09291
论文摘要
Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.
论文简评: LAVCap是一项在多模态语言建模(MLM)领域的重要贡献,提供了一个新的视角,通过最优传输理论来弥合音频和视觉信息之间的差距。作者提出了他们的创新框架LAVCap,结合了深度学习和最优传输理论的原理。该方法不仅实现了高效对齐,还显著提高了生成字幕的整体质量,从而增强了最终输出的准确性。
作者进行的实验评估表明,LAVCap在AudioCaps数据集上的表现优于其竞争对手,显现出比现有最先进方法更卓越的性能。这些发现强调了将视觉信息纳入训练过程的重要性,以实现更连贯和准确的音频字幕。此外,包含的人工评估突出了LAVCap的实际应用潜力,强调了其在文本转语音合成和语音识别系统等多个领域的影响。
总的来说,LAVCap在多模态语言处理领域代表了一个显著的进步,展示了利用最优传输解决多媒体内容理解挑战的力量。该论文的贡献不仅推动了我们对多模态学习的理解,也为该领域未来的研究方向提供了宝贵的见解。整体而言,LAVCap是机器学习在提升自然语言处理能力方面的一个引人注目的例证。
4.Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition
Authors: Takaaki Hori, Martin Kocour, Adnan Haider, Erik McDermott, Xiaodan Zhuang
Affiliations: Apple; Brno University of Technology
https://arxiv.org/abs/2501.09258
论文摘要
This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). Although shallow fusion is the most common approach to incorporate language models into E2E-ASR decoding, we face two practical problems with LLMs. (1) LLM inference is computationally costly. (2) There may be a vocabulary mismatch between the ASR model and the LLM. To resolve this mismatch, we need to retrain the ASR model and/or the LLM, which is at best time-consuming and in many cases not feasible. We propose delayed fusion, which applies LLM scores to ASR hypotheses with a delay during decoding and enables easier use of pre-trained LLMs in ASR tasks. This method can reduce not only the number of hypotheses scored by the LLM but also the number of LLM inference calls. It also allows re-tokenization of ASR hypotheses during decoding if ASR and LLM employ different tokenizations. We demonstrate that delayed fusion provides improved decoding speed and accuracy compared to shallow fusion and N-best rescoring using the LibriHeavy ASR corpus and three public LLMs, OpenLLaMA 3B & 7B and Mistral 7B.
论文简评: 本文探讨了一种名为"延迟融合"的新方法,旨在将大型语言模型(LLM)整合到端到端自动语音识别(E2E-ASR)系统中。该研究提出了一种创新的方法来解决大型语言模型与端到端自动语音识别系统之间的计算成本和词汇不匹配问题。通过在LibriHeavy语料库上的实验,作者展示了延迟融合对解码速度和准确性的改进,为延迟融合的应用提供了有力证据。此外,这种技术允许使用预训练的LLM而不必重新训练ASR模型,使其更适合于实际应用。总之,这篇文章提出了一个有效的解决方案,并显示了其在提高解码速度和准确性方面的显著效果。
5.Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
Authors: Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, Dongsheng Li
Affiliations: The Chinese University of Hong Kong; Microsoft Research Asia; The Chinese University of Hong Kong, Shenzhen Research Institute (SZRI)
https://arxiv.org/abs/2501.09695
论文摘要
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
论文简评: 这篇论文主要探讨了如何通过引入专家反馈来优化大型视觉语言模型(Large Vision-Language Models)中的直接偏好优化(Direct Preference Optimization)。通过对现有研究的深入分析,作者发现直接偏好优化算法需要大量的训练数据才能获得良好的性能。因此,他们提出了一个新框架,名为OPA-DPO,旨在解决这个问题。
OPA-DPO框架的核心在于利用专家反馈对学习过程进行监督,从而提高模型的泛化能力。理论分析部分详细阐述了该方法的工作原理,并提供了相应的数学证明。这些理论基础为实际应用提供了一定程度的指导。
实验结果表明,在使用较少的数据的情况下,OPA-DPO可以显著改善直接偏好优化的结果。相比现有的直接偏好优化算法,它能够实现更高的准确率和更好的鲁棒性。这一成果对于解决大规模预训练任务中的问题具有重要意义。
综上所述,本文提出的OPA-DPO框架不仅解决了直接偏好优化中常见的问题,并且其理论分析和实验证明了其优越性。未来的研究可以进一步探索如何更有效地利用有限的数据资源,以及如何进一步优化OPA-DPO的方法以提升其整体性能。
6. 如何系统学习掌握AI大模型?
AI大模型作为人工智能领域的重要技术突破,正成为推动各行各业创新和转型的关键力量。抓住AI大模型的风口,掌握AI大模型的知识和技能将变得越来越重要。
学习AI大模型是一个系统的过程,需要从基础开始,逐步深入到更高级的技术。
这里给大家精心整理了一份
全面的AI大模型学习资源
,包括:AI大模型全套学习路线图(从入门到实战)、精品AI大模型学习书籍手册、视频教程、实战学习、面试题等,资料免费分享
!
1. 成长路线图&学习规划
要学习一门新的技术,作为新手一定要先学习成长路线图,方向不对,努力白费。
这里,我们为新手和想要进一步提升的专业人士准备了一份详细的学习成长路线图和规划。可以说是最科学最系统的学习成长路线。
2. 大模型经典PDF书籍
书籍和学习文档资料是学习大模型过程中必不可少的,我们精选了一系列深入探讨大模型技术的书籍和学习文档,它们由领域内的顶尖专家撰写,内容全面、深入、详尽,为你学习大模型提供坚实的理论基础。(书籍含电子版PDF)
3. 大模型视频教程
对于很多自学或者没有基础的同学来说,书籍这些纯文字类的学习教材会觉得比较晦涩难以理解,因此,我们提供了丰富的大模型视频教程,以动态、形象的方式展示技术概念,帮助你更快、更轻松地掌握核心知识。
4. 2024行业报告
行业分析主要包括对不同行业的现状、趋势、问题、机会等进行系统地调研和评估,以了解哪些行业更适合引入大模型的技术和应用,以及在哪些方面可以发挥大模型的优势。
5. 大模型项目实战
学以致用 ,当你的理论知识积累到一定程度,就需要通过项目实战,在实际操作中检验和巩固你所学到的知识,同时为你找工作和职业发展打下坚实的基础。
6. 大模型面试题
面试不仅是技术的较量,更需要充分的准备。
在你已经掌握了大模型技术之后,就需要开始准备面试,我们将提供精心整理的大模型面试题库,涵盖当前面试中可能遇到的各种技术问题,让你在面试中游刃有余。
全套的AI大模型学习资源已经整理打包,有需要的小伙伴可以
微信扫描下方优快云官方认证二维码
,免费领取【保证100%免费
】