NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

本文提出NPHardEval4V基准,用于评估多模态大型语言模型(MLLMs)的动态推理能力。此基准通过将问题从文本转换为图像表示,专注于模型的推理性能,揭示了不同模型之间的显著差异,并每月更新以防止过拟合。研究发现,模型的性能受输入类型(如视觉、文本或两者结合)影响,强调了动态评估工具对于理解和提升MLLMs的重要性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文是LLM系列文章,针对《NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models》的翻译。

NPHardEval4V:一个多模态大型语言模型的动态推理基准

摘要

理解多模态大型语言模型(MLLMs)的推理能力是一个重要的研究领域。在这项研究中,我们引入了一个动态基准NPHardEval4V,旨在解决在评估MLLM的纯推理能力方面存在的差距。我们的基准测试旨在提供一个场所,将图像识别和指令遵循等各种因素的影响与模型的整体性能区分开来,使我们能够专注于评估它们的推理能力。它是通过将问题的文本描述从NPHardEval转换为图像表示来构建的。我们的研究结果揭示了不同模型之间推理能力的显著差异,并突出了MLLM在推理方面与LLM相比相对较弱的性能。我们还研究了不同提示风格,包括视觉、文本以及视觉和文本组合提示,对MLLMs推理能力的影响,展示了多模态输入对模型性能的不同影响。与主要关注静态评估的传统基准不同,我们的基准将每月更新一次,以防止过度拟合,并确保对模型进行更真实、更细粒度的评估。我们相信,这个基准可以帮助理解和指导MLLMs推理能力的进一步发展。基准数据集和代码在https://github.com/lizhouf/NPHardEval4V可用。

1 引言

### Chain-of-Thought Prompting Mechanism in Large Language Models In large language models, chain-of-thought prompting serves as a method to enhance reasoning capabilities by guiding the model through structured thought processes. This approach involves breaking down complex problems into simpler components and providing step-by-step guidance that mirrors human cognitive processing. The creation of these prompts typically includes selecting examples from training datasets where each example represents part of an overall problem-solving process[^2]. By decomposing tasks into multiple steps, this technique encourages deeper understanding and more accurate predictions compared to traditional methods. For instance, when faced with multi-hop question answering or logical deduction challenges, using such chains allows models not only to generate correct answers but also articulate intermediate thoughts leading up to those conclusions. Such transparency facilitates better interpretability while improving performance on various NLP benchmarks. ```python def create_chain_of_thought_prompt(task_description, examples): """ Creates a chain-of-thought prompt based on given task description and examples. Args: task_description (str): Description of the task at hand. examples (list): List containing tuples of input-output pairs used for demonstration purposes. Returns: str: Formatted string representing the final prompt including both instructions and sample cases. """ formatted_examples = "\n".join([f"Input: {ex[0]}, Output: {ex[1]}" for ex in examples]) return f""" Task: {task_description} Examples: {formatted_examples} Now try solving similar questions following above pattern. """ # Example usage examples = [ ("What color do you get mixing red and blue?", "Purple"), ("If it rains tomorrow, will we have our picnic?", "No") ] print(create_chain_of_thought_prompt("Solve logic puzzles", examples)) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值