《AGENT AI- SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION》
https://download.youkuaiyun.com/download/ajian005/90168014

ABSTRACT(摘要)
Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define “Agent AI” as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally- grounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
多模态交互前沿展望:智能体人工智能
摘要
多模态人工智能系统很可能在我们日常生活中变得无处不在。一个使这些系统更具交互性的有前景的方法是将它们构建为物理和虚拟环境中的智能体。目前,系统利用现有的基础模型作为创建具身智能体的基本构建块。将智能体嵌入到这些环境中,有助于模型处理和解释视觉和上下文数据,这对于创建更复杂和具有上下文感知能力的人工智能系统至关重要。例如,一个能够感知用户行为、人类行为、环境物体、音频表达以及场景的集体情绪的系统,可以用来告知和指导智能体在给定环境中的反应。为了加速基于智能体的多模态智能研究,我们将“智能体人工智能(Agent AI)”定义为一类交互系统,它可以感知视觉刺激、语言输入和其他基于环境的数据,并能产生有意义的具身行动。我们特别探索旨在通过整合外部知识、多感官输入和人类反馈来改进基于下一具身行动预测的智能体的系统。我们认为,通过在基于环境的环境中开发智能体人工智能系统,还可以减轻大型基础模型的幻觉及其产生环境不正确输出的倾向。新兴的智能体人工智能领域涵盖了多模态交互更广泛的具身和智能体方面。除了在物理世界中行动和交互的智能体之外,我们还设想了一个未来,人们可以轻松创建任何虚拟现实或模拟场景,并与嵌入在虚拟环境中的智能体进行交互。
更详细的解释:
想象一下,未来的生活里,人工智能不仅仅存在于手机或电脑里,而是像一个“智能体”一样,存在于我们周围的环境中。它可以“看”到我们(通过摄像头),“听”到我们(通过麦克风),甚至能理解我们所处的环境。这就是“多模态人工智能”和“具身智能体”结合的愿景。
什么是多模态人工智能?
简单来说,就是人工智能系统能够处理多种类型的信息,比如图像、声音、文字等等,就像人类一样,通过多种感官来感知世界。
什么是具身智能体?
就是把人工智能“装”在一个“身体”里,这个“身体”可以是机器人,也可以是虚拟世界中的一个虚拟形象。这样,人工智能就能够通过这个“身体”与环境进行互动,而不仅仅是处理数据。
智能体人工智能(Agent AI)
本文提出的“智能体人工智能”就是指这样一类系统:
●它们能够感知环境,包括视觉信息(比如物体、人物)、语言信息(比如我们说的话)以及其他环境信息。
●它们能够根据感知到的信息做出相应的行动,比如移动、说话、操作物体等等。
为什么要研究智能体人工智能?
●更强的交互性: 具身智能体可以更自然地与人类互动,提供更人性化的服务。
●更强的环境感知能力: 通过感知环境,智能体可以更好地理解上下文,做出更明智的决策。
●减少“幻觉”: 大型语言模型有时会产生不符合事实或逻辑的输出,被称为“幻觉”。通过让智能体在真实或模拟环境中“生活”,可以帮助它们更好地理解现实世界,从而减少“幻觉”的发生。
未来的展望
文章展望了这样一个未来:我们可以轻松创建各种虚拟场景,并与存在于这些虚拟场景中的智能体进行互动。这为游戏、教育、培训等领域带来了无限的可能性。
总而言之,智能体人工智能是一个充满希望的研究方向,它将推动人工智能从单纯的数据处理走向更智能、更具交互性的方向发展,并可能在未来深刻地改变我们的生活。
1 Introduction(介绍)
1.1 Motivation(动机)
Agent AI: Surveying the Horizons of Multimodal Interaction A PREPRINT
Historically, AI systems were defined at the 1956 Dartmouth Conference as artificial life forms that could collect information from the environment and interact with it in useful ways. Motivated by this definition, Minsky’s MIT group built in 1970 a robotics system, called the “Copy Demo,” that observed “blocks world” scenes and successfully reconstructed the observed polyhedral block structures. The system, which comprised observation, planning, and manipulation modules, revealed that each of these subproblems is highly challenging and further research was necessary. The AI field fragmented into specialized subfields that have largely independently made great progress in tackling these and other problems, but over-reductionism has blurred the overarching goals of AI research.
To advance beyond the status quo, it is necessary to return to AI fundamentals motivated by Aristotelian Holism. Fortunately, the recent revolution in Large Language Models (LLMs) and Visual Language Models (VLMs) has made it possible to create novel AI agents consistent with the holistic ideal. Seizing upon this opportunity, this article explores models that integrate language proficiency, visual cognition, context memory, intuitive reasoning, and adaptability. It explores the potential completion of this holistic synthesis using LLMs and VLMs. In our exploration, we also revisit system design based on Aristotle’s Final Cause, the teleological “why the system exists”, which may have been overlooked in previous rounds of AI development.
With the advent of powerful pretrained LLMs and VLMs, a renaissance in natural language processing and computer vision has been catalyzed. LLMs now demonstrate an impressive ability to decipher the nuances of real-world linguistic data, often achieving abilities that parallel or even surpass human expertise (OpenAI, 2023). Recently, researchers have shown that LLMs may be extended to act as agents within various environments, performing intricate actions and tasks when paired with domain-specific knowledge and modules (Xi et al., 2023). These scenarios, characterized by complex reasoning, understanding of the agent’s role and its environment, along with multi-step planning, test the agent’s ability to make highly nuanced and intricate decisions within its environmental constraints (Wu et al., 2023; Meta Fundamental AI Research (FAIR) Diplomacy Team et al., 2022).
Building upon these initial efforts, the AI community is on the cusp of a significant paradigm shift, transitioning from creating AI models for passive, structured tasks to models capable of assuming dynamic, agentic roles in diverse and complex environments. In this context, this article investigates the immense potential of using LLMs and VLMs as agents, emphasizing models that have a blend of linguistic proficiency, visual cognition, contextual memory, intuitive reasoning, and adaptability. Leveraging LLMs and VLMs as agents, especially within domains like gaming, robotics, and healthcare, promises not just a rigorous evaluation platform for state-of-the-art AI systems, but also foreshadows the transformative impacts that Agent-centric AI will have across society and industries. When fully harnessed, agentic models can redefine human experiences and elevate operational standards. The potential for sweeping automation ushered in by these models portends monumental shifts in industries and socio-economic dynamics. Such advancements will be intertwined with multifaceted leader-board, not only technical but also ethical, as we will elaborate upon in Section 11. We delve into the overlapping areas of these sub-fields of Agent AI and illustrate their interconnectedness in Fig.1.
1.1 动机
智能体人工智能:多模态交互前沿展望(预印本)
回顾历史,1956 年的达特茅斯会议将人工智能系统定义为能够从环境中收集信息并以有用的方式与之交互的人造生命形式。受此定义启发,明斯基于 1970 年在麻省理工学院的研究小组构建了一个名为“复制演示”的机器人系统,该系统观察“积木世界”场景并成功重建了观察到的多面体积木结构。该系统包含观察、规划和操作模块,揭示了这些子问题中的每一个都极具挑战性,需要进一步研究。人工智能领域分裂成各个专业子领域,这些子领域在很大程度上独立地在解决这些问题和其他问题方面取得了巨大进展,但过度简化模糊了人工智能研究的首要目标。
为了超越现状,有必要回归亚里士多德整体论驱动的人工智能基本原理。幸运的是,近期大型语言模型 (LLM) 和视觉语言模型 (VLM) 的革命使得创建符合整体理想的新型人工智能智能体成为可能。抓住这一机遇,本文探讨了整合语言能力、视觉认知、上下文记忆、直觉推理和适应性的模型。它探讨了使用 LLM 和 VLM 完成这种整体综合的潜力。在我们的探索中,我们还重新审视了基于亚里士多德“最终原因”(即目的论的“系统存在的原因”)的系统设计,这可能在先前的人工智能开发中被忽视了。
随着功能强大的预训练 LLM 和 VLM 的出现,自然语言处理和计算机视觉领域迎来了一次复兴。LLM 现在展示出令人印象深刻的破译现实世界语言数据细微差别的能力,其能力通常可以与人类专业知识相媲美甚至超越人类专业知识 (OpenAI, 2023)。最近,研究人员表明,当与特定领域的知识和模块配对时,LLM 可以扩展为在各种环境中充当智能体,执行复杂的行动和任务 (Xi et al., 2023)。这些场景以复杂的推理、对智能体角色及其环境的理解以及多步骤规划为特征,测试了智能体在其环境约束内做出高度细致和复杂决策的能力 (Wu et al., 2023; Meta Fundamental AI Research (FAIR) Diplomacy Team et al., 2022)。
在这些初步努力的基础上,人工智能社区正处于一个重大范式转变的边缘,即从创建用于被动、结构化任务的人工智能模型转变为能够承担在各种复杂环境中动态、智能体角色的模型。在此背景下,本文研究了使用 LLM 和 VLM 作为智能体的巨大潜力,重点强调兼具语言能力、视觉认知、上下文记忆、直觉推理和适应性的模型。利用 LLM 和 VLM 作为智能体,尤其是在游戏、机器人技术和医疗保健等领域,不仅为最先进的人工智能系统提供了一个严格的评估平台,而且预示了以智能体为中心的人工智能将对社会和各行业产生的变革性影响。当完全利用时,智能体模型可以重新定义人类体验并提升运营标准。这些模型带来的全面自动化潜力预示着行业和社会经济动态的巨大变化。此类进步将与多方面的领导力(不仅是技术方面的,还有伦理方面的)交织在一起,我们将在第 11 节中详细阐述。我们深入研究了智能体人工智能的这些子领域的重叠区域,并在图 1 中说明了它们的相互关联性。
简单来说:
这段话阐述了研究“智能体人工智能”的动机。
●历史回顾: 人工智能的早期定义强调智能体与环境的互动。但后来的研究过于关注子问题,忽略了整体目标。
●新的机遇: 大型语言模型(LLM)和视觉语言模型(VLM)的出现,为构建能够综合运用语言、视觉、记忆、推理和适应性的“智能体”提供了可能。
●回归根本: 研究需要回归人工智能的根本目标,并重新思考系统设计的“最终原因”,即系统存在的意义和目的。
●LLM 和 VLM 的力量: LLM 和 VLM 在理解语言和视觉信息方面表现出色,可以作为智能体的核心组成部分。
●范式转变: 人工智能正在从处理被动任务的模型转向能够在复杂环境中扮演主动角色的“智能体”。
●广泛应用和影响: “智能体人工智能”将在游戏、机器人、医疗等领域产生重大影响,并带来社会经济的变革,同时也需要关注伦理问题。
这段话强调了利用 LLM 和 VLM 构建“智能体”的重要性,并展望了“智能体人工智能”的未来发展和潜在影响。
1.2 Background(背景)
We will now introduce relevant research papers that support the concepts, theoretical background, and modern implementations of Agent AI.
Large Foundation Models: LLMs and VLMs have been driving the effort to develop general intelligent machines (Bubeck et al., 2023; Mirchandani et al., 2023). Although they are trained using large text corpora, their superior problem-solving capacity is not limited to canonical language processing domains. LLMs can potentially tackle complex tasks that were previously presumed to be exclusive to human experts or domain-specific algorithms, ranging from mathematical reasoning (Imani et al., 2023; Wei et al., 2022; Zhu et al., 2022) to answering questions of professional law (Blair-Stanek et al., 2023; Choi et al., 2023; Nay, 2022). Recent research has shown the possibility of using LLMs to generate complex plans for robots and game AI (Liang et al., 2022; Wang et al., 2023a,b; Yao et al., 2023a; Huang et al., 2023a), marking an important milestone for LLMs as general-purpose intelligent agents.
现在,我们将介绍支持智能体人工智能的概念、理论背景和现代实现的相关研究论文。
大型基础模型: LLM 和 VLM 一直是推动开发通用智能机器的核心力量 (Bubeck et al., 2023; Mirchandani et al., 2023)。虽然它们是使用大型文本语料库进行训练的,但它们卓越的问题解决能力并不局限于典型的语言处理领域。LLM 有潜力处理以前被认为仅限于人类专家或特定领域算法的复杂任务,范围从数学推理 (Imani et al., 2023; Wei et al., 2022; Zhu et al., 2022) 到回答专业法律问题 (Blair-Stanek et al., 2023; Choi et al., 2023; Nay, 2022)。最近的研究表明,可以使用 LLM 为机器人和游戏 AI 生成复杂计划 (Liang et al., 2022; Wang et al., 2023a,b; Yao et al., 2023a; Huang et al., 2023a),这标志着 LLM 作为通用智能体的一个重要里程碑。
简单来说:
本节主要介绍了大型基础模型(LLM 和 VLM)在推动智能体人工智能发展中的作用。
●通用智能的驱动力: LLM 和 VLM 不仅限于处理文本,它们在解决各种复杂问题方面都表现出了强大的能力,例如数学推理和法律问答。
●超越语言领域: 尽管使用文本数据进行训练,但 LLM 的能力远不止于语言处理,它们可以应用于更广泛的领域。
●智能体的雏形: 研究表明,LLM 可以用于为机器人和游戏 AI 制定复杂的行动计划,这为将 LLM 视为通用智能体奠定了基础。
这段话强调了 LLM 和 VLM 的通用性和多功能性,以及它们作为构建智能体的关键组成部分的潜力。通过引用一系列研究论文,作者试图用已有的研究成果来支撑其提出的“智能体人工智能”的概念。
Embodied AI: A number of works leverage LLMs to perform task planning (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a), specifically the LLMs’ WWW-scale domain knowledge and emergent zero-shot embodied abilities to perform complex task planning and reasoning. Recent robotics research also leverages LLMs to perform task planning (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022) by decomposing natural language instruction into a sequence of subtasks, either in the natural language form or in Python code, then using a low-level controller to execute these subtasks. Additionally, they incorporate environmental feedback to improve task performance (Huang et al., 2022b), (Liang et al., 2022), (Wang et al., 2023a), and (Ikeuchi et al., 2023).
具身人工智能: 许多研究工作利用 LLM 执行任务规划 (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a),特别是利用 LLM 的万维网规模的领域知识和新兴的零样本具身能力来执行复杂的任务规划和推理。最近的机器人技术研究也利用 LLM 通过将自然语言指令分解为一系列子任务(以自然语言形式或 Python 代码形式)来执行任务规划 (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022),然后使用底层控制器来执行这些子任务。此外,他们还结合了环境反馈来提高任务表现 (Huang et al., 2022b), (Liang et al., 2022), (Wang et al., 2023a), 和 (Ikeuchi et al., 2023)。
简单来说:
本节讨论了如何将 LLM 应用于“具身人工智能”,即让 AI 在物理或虚拟环境中执行任务。
●LLM 用于任务规划: 研究人员利用 LLM 强大的语言理解和推理能力,将其用于分解复杂的任务,并生成执行这些任务的计划。
●分解任务: LLM 可以将人类用自然语言发出的指令分解成一系列更小的、更易于执行的子任务。这些子任务可以用自然语言描述,也可以转换成计算机可以执行的代码(例如 Python)。
●与机器人技术结合: 在机器人领域,LLM 生成的子任务计划可以交给机器人底层的控制系统执行,从而实现机器人的自主行动。
●环境反馈: 一些研究还强调了环境反馈的重要性。通过感知环境的变化并根据反馈调整行动,AI 可以更好地完成任务。
这段话的核心是强调 LLM 在具身人工智能中的作用,特别是其在任务规划和分解方面的能力,以及如何将其与机器人技术结合,使 AI 能够在环境中执行实际操作。同时,也指出了环境反馈在提高 AI 表现中的重要性。
Interactive Learning: AI agents designed for interactive learning operate using a combination of machine learning techniques and user interactions. Initially, the AI agent is trained on a large dataset. This dataset includes various types of information, depending on the intended function of the agent. For instance, an AI designed for language tasks would be trained on a massive corpus of text data. The training involves using machine learning algorithms, which could include deep learning models like neural networks. These training models enable the AI to recognize patterns, make predictions, and generate responses based on the data on which it was trained. The AI agent can also learn from real-time interactions with users. This interactive learning can occur in various ways: 1) Feedback-based learning: The AI adapts its responses based on direct user feedback (Li et al., 2023b; Yu et al., 2023a; Parakh et al., 2023; Zha et al., 2023; Wake et al., 2023a,b,c). For example, if a user corrects the AI’s response, the AI can use this information to improve future responses (Zha et al., 2023; Liu et al., 2023a). 2) Observational Learning: The AI observes user interactions and learns implicitly. For example, if users frequently ask similar questions or interact with the AI in a particular way, the AI might adjust its responses to better suit these patterns. It allows the AI agent to understand and process human language, multi-model setting, interpret the cross reality-context, and generate human-users’ responses. Over time, with more user interactions and feedback, the AI agent’s performance generally continuous improves. This process is often supervised by human operators or developers who ensure that the AI is learning appropriately and not developing biases or incorrect patterns.
交互式学习: 为交互式学习设计的人工智能智能体结合了机器学习技术和用户交互进行操作。最初,人工智能智能体是在大型数据集上进行训练的。该数据集包含各种类型的信息,具体取决于智能体的预期功能。例如,为语言任务设计的人工智能将在大量的文本数据语料库上进行训练。训练涉及使用机器学习算法,其中可能包括像神经网络这样的深度学习模型。这些训练模型使人工智能能够根据其训练的数据识别模式、进行预测和生成响应。人工智能智能体还可以从与用户的实时交互中学习。这种交互式学习可以通过多种方式进行:
1基于反馈的学习: 人工智能根据用户的直接反馈调整其响应 (Li et al., 2023b; Yu et al., 2023a; Parakh et al., 2023; Zha et al., 2023; Wake et al., 2023a,b,c)。例如,如果用户纠正了人工智能的响应,人工智能可以利用此信息来改进未来的响应 (Zha et al., 2023; Liu et al., 2023a)。
2观察式学习: 人工智能观察用户交互并进行隐式学习。例如,如果用户经常提出类似的问题或以特定的方式与人工智能交互,人工智能可能会调整其响应以更好地适应这些模式。它允许人工智能智能体理解和处理人类语言、多模态设置、解释跨现实环境,并生成人类用户的响应。
随着时间的推移,通过更多的用户交互和反馈,人工智能智能体的性能通常会持续提高。此过程通常由人工操作员或开发人员监督,他们确保人工智能正在适当地学习,并且没有产生偏差或不正确的模式。
简单来说:
本节介绍了人工智能智能体如何通过与用户的互动进行学习,主要有两种方式:
●基于反馈的学习: 用户直接告诉 AI 哪里做得好或不好,AI 根据这些反馈进行改进。例如,用户纠正了 AI 的错误回答,AI 就会记住这个错误,下次避免犯同样的错误。
●观察式学习: AI 通过观察用户的行为和交互模式来学习。例如,如果用户经常问同一个问题,AI 可能会调整回答方式,使其更符合用户的需求。
这段话强调了交互式学习对于提升 AI 智能体性能的重要性。通过与用户的持续互动和反馈,AI 可以不断学习和改进,更好地理解人类语言和行为,并在多模态环境下做出更合适的反应。同时,人工监督也十分重要,以防止 AI 产生偏差或学习到错误的模式。
1.3 Overview(总览)
Multimodal Agent AI (MAA) is a family of systems that generate effective actions in a given environment based on the understanding of multimodal sensory input. With the advent of Large Language Models (LLMs) and Vision- Language Models (VLMs), numerous MAA systems have been proposed in fields ranging from basic research to applications. While these research areas are growing rapidly by integrating with the traditional technologies of each domain (e.g., visual question answering and vision-language navigation), they share common interests such as data collection, benchmarking, and ethical perspectives. In this paper, we focus on the some representative research areas of MAA, namely multimodality, gaming (VR/AR/MR), robotics, and healthcare, and we aim to provide comprehensive knowledge on the common concerns discussed in these fields. As a result we expect to learn the fundamentals of MAA and gain insights to further advance their research. Specific learning outcomes include:
● MAA Overview: A deep dive into its principles and roles in contemporary applications, providing researcher with a thorough grasp of its importance and uses.
● Methodologies: Detailed examples of how LLMs and VLMs enhance MAAs, illustrated through case studies in gaming, robotics, and healthcare.
● Performance Evaluation: Guidance on the assessment of MAAs with relevant datasets, focusing on their effectiveness and generalization.
● Ethical Considerations: A discussion on the societal impacts and ethical leader-board of deploying Agent AI, highlighting responsible development practices.
● Emerging Trends and Future leader-board: Categorize the latest developments in each domain and discuss the future directions.
Computer-based action and generalist agents (GAs) are useful for many tasks. A GA to become truly valuable to its users, it can natural to interact with, and generalize to a broad range of contexts and modalities. We aims to cultivate a vibrant research ecosystem and create a shared sense of identity and purpose among the Agent AI community. MAA has the potential to be widely applicable across various contexts and modalities, including input from humans. Therefore, we believe this Agent AI area can engage a diverse range of researchers, fostering a dynamic Agent AI community and
shared goals. Led by esteemed experts from academia and industry, we expect that this paper will be an interactive and enriching experience, complete with agent instruction, case studies, tasks sessions, and experiments discussion ensuring a comprehensive and engaging learning experience for all researchers.
This paper aims to provide general and comprehensive knowledge about the current research in the field of Agent AI. To this end, the rest of the paper is organized as follows. Section 2 outlines how Agent AI benefits from integrating with related emerging technologies, particularly large foundation models. Section 3 describes a new paradigm and framework that we propose for training Agent AI. Section 4 provides an overview of the methodologies that are widely used in the training of Agent AI. Section 5 categorizes and discusses various types of agents. Section 6 introduces Agent AI applications in gaming, robotics, and healthcare. Section 7 explores the research community’s efforts to develop a versatile Agent AI, capable of being applied across various modalities, domains, and bridging the sim-to-real gap. Section 8 discusses the potential of Agent AI that not only relies on pre-trained foundation models, but also continuously learns and self-improves by leveraging interactions with the environment and users. Section 9 introduces our new datasets that are designed for the training of multimodal Agent AI. Section 11 discusses the hot topic of the ethics consideration of AI agent, limitations, and societal impact of our paper.
多模态智能体人工智能 (MAA) 是一系列系统的总称,这些系统基于对多模态感官输入的理解,在给定环境中生成有效的行动。随着大型语言模型 (LLM) 和视觉语言模型 (VLM) 的出现,从基础研究到应用领域,已经提出了许多 MAA 系统。虽然这些研究领域通过与各个领域的传统技术(例如,视觉问答和视觉语言导航)集成而迅速发展,但它们拥有共同的关注点,例如数据收集、基准测试和伦理视角。在本文中,我们重点关注 MAA 的一些代表性研究领域,即多模态、游戏(VR/AR/MR)、机器人技术和医疗保健,我们的目标是提供关于这些领域中讨论的共同问题的全面知识。因此,我们希望学习 MAA 的基本原理并获得进一步推进其研究的见解。具体的学习成果包括:
●MAA 概述: 深入探讨其在当代应用中的原则和作用,使研究人员全面掌握其重要性和用途。
●方法论: 通过游戏、机器人技术和医疗保健的案例研究,详细说明 LLM 和 VLM 如何增强 MAA。
●性能评估: 提供使用相关数据集评估 MAA 的指南,重点关注其有效性和泛化能力。
●伦理考量: 讨论部署智能体人工智能的社会影响和伦理领导力,强调负责任的开发实践。
●新兴趋势和未来领导力: 对每个领域的最新发展进行分类并讨论未来的方向。
基于计算机的行动和通用智能体 (GA) 对许多任务都很有用。为了使 GA 对其用户真正有价值,它可以自然地进行交互,并推广到广泛的上下文和模态。我们的目标是培养一个充满活力的研究生态系统,并在智能体人工智能社区中创造一种共同的认同感和目标感。MAA 有潜力广泛应用于各种上下文和模态,包括来自人类的输入。因此,我们相信这个智能体人工智能领域可以吸引各种各样的研究人员,从而促进一个充满活力的智能体人工智能社区和共同目标。在来自学术界和工业界的受人尊敬的专家的带领下,我们希望本文将是一次互动且丰富的体验,包括智能体指导、案例研究、任务会议和实验讨论,确保为所有研究人员提供全面且引人入胜的学习体验。
本文旨在提供关于智能体人工智能领域当前研究的一般性和全面性知识。为此,本文的其余部分组织如下:
第 2 节概述了智能体人工智能如何从与相关新兴技术(尤其是大型基础模型)的集成中受益。
第 3 节描述了我们为训练智能体人工智能而提出的新范式和框架。
第 4 节概述了智能体人工智能训练中广泛使用的方法。
第 5 节对各种类型的智能体进行分类和讨论。
第 6 节介绍了智能体人工智能在游戏、机器人技术和医疗保健领域的应用。
第 7 节探讨了研究界为开发一种能够应用于各种模态、领域并弥合模拟到现实差距的多功能智能体人工智能所做的努力。
第 8 节讨论了智能体人工智能的潜力,它不仅依赖于预训练的基础模型,而且还通过利用与环境和用户的交互进行持续学习和自我改进。第 9 节介绍了我们为训练多模态智能体人工智能而设计的新数据集。第 11 节讨论了人工智能智能体的伦理考量、局限性以及本文的社会影响这一热门话题。
简单来说:
这段话介绍了“多模态智能体人工智能 (MAA)”的概念,并概述了本文的主要内容和目标。
●MAA 的定义: MAA 是一种基于多模态输入(例如视觉、听觉、语言等)在环境中采取有效行动的系统。
●LLM/VLM 的推动: LLM 和 VLM 的发展推动了 MAA 领域的快速发展。
●研究重点: 本文关注多模态、游戏(VR/AR/MR)、机器人和医疗保健等 MAA 的代表性研究领域,并探讨这些领域中的共同问题,例如数据收集、基准测试和伦理问题。
●学习目标: 本文旨在帮助读者全面了解 MAA 的基本原理、方法、评估方式、伦理考量以及未来趋势。
●通用智能体 (GA): 文章强调了通用智能体的重要性,它们需要能够自然地与人类互动,并适应各种环境和模态。
●社区建设: 作者希望通过本文促进智能体人工智能研究社区的发展,并建立共同的目标和认同感。
●文章结构: 文章的后续部分将分别介绍 MAA 与新兴技术的集成、新的训练范式和框架、常用方法、智能体类型、应用领域、多模态应用、持续学习、新数据集以及伦理考量等内容。
这段话清晰地阐述了 MAA 的概念、研究重点和目标,并为读者提供了阅读本文的路线图。它强调了多模态输入、LLM/VLM 的作用、通用性、伦理以及社区建设在 MAA 领域的重要性。
2 Agent AI Integration(代理人工智能集成)
Foundation models based on LLMs and VLMs, as proposed in previous research, still exhibit limited performance in the area of embodied AI, particularly in terms of understanding, generating, editing, and interacting within unseen environments or scenarios (Huang et al., 2023a; Zeng et al., 2023). Consequently, these limitations lead to sub-optimal outputs from AI agents. Current agent-centric AI modeling approaches focus on directly accessible and clearly defined data (e.g. text or string representations of the world state) and generally use domain and environment-independent patterns learned from their large-scale pretraining to predict action outputs for each environment (Xi et al., 2023; Wang et al., 2023c; Gong et al., 2023a; Wu et al., 2023). In (Huang et al., 2023a), we investigate the task of knowledge-guided collaborative and interactive scene generation by combining large foundation models, and show promising results that indicate knowledge-grounded LLM agents can improve the performance of 2D and 3D scene understanding, generation, and editing, alongside with other human-agent interactions (Huang et al., 2023a). By integrating an Agent AI framework, large foundation models are able to more deeply understand user input to form a complex and adaptive HCI system. Emergent ability of LLM and VLM works invisible in generative AI, embodied AI, knowledge augmentation for multi-model learning, mix-reality generation, text to vision editing, human interaction for 2D/3D simulation in gaming or robotics tasks. Agent AI recent progress in foundation models present an imminent catalyst for unlocking general intelligence in embodied agents. The large action models, or agent-vision-language models open new possibilities for general-purpose embodied systems such as planning, problem-solving and learning in complex environments. Agent AI test further step in metaverse, and route the early version of AGI.
2 代理人工智能集成 基于LLMs和VLMs的基础模型,尽管在先前的研究中已被提出,但在具身人工智能领域仍表现出有限的性能,尤其是在理解、生成、编辑以及与未见过的环境或场景交互方面(Huang et al., 2023a; Zeng et al., 2023)。因此,这些限制导致人工智能代理的输出未能达到最优。当前以代理为中心的人工智能建模方法侧重于直接可访问且明确定义的数据(例如,世界状态的文本或字符串表示),并通常利用从大规模预训练中学习到的领域和环境无关的模式来预测每个环境的行动输出(Xi et al., 2023; Wang et al., 2023c; Gong et al., 2023a; Wu et al., 2023)。在(Huang et al., 2023a)中,我们通过结合大型基础模型研究了知识引导的协作和交互式场景生成任务,并展示了有希望的结果,表明基于知识的LLM代理可以提升2D和3D场景理解、生成和编辑的性能,同时也能改善其他人机交互(Huang et al., 2023a)。通过集成代理人工智能框架,大型基础模型能够更深入地理解用户输入,从而形成一个复杂且自适应的人机交互系统。LLM和VLM的涌现能力在生成式AI、具身AI、多模型学习的知识增强、混合现实生成、文本到视觉编辑、游戏或机器人任务中的2D/3D模拟人机交互等方面发挥着隐形的作用。代理人工智能在基础模型中的最新进展为解锁具身代理的通用智能提供了迫在眉睫的催化剂。大型行动模型或代理-视觉-语言模型为通用具身系统(如复杂环境中的规划、问题解决和学习)开辟了新的可能性。代理人工智能在元宇宙中迈出了进一步的步伐,并引领了早期版本AGI的发展路径。
2.1 Infinite AI agent(AI代理的能力与限制)
AI agents have the capacity to interpret, predict, and respond based on its training and input data. While these capabilities are advanced and continually improving, it’s important to recognize their limitations and the influence of the underlying data they are trained on. AI agent systems generally possess the following abilities: 1) Predictive Modeling: AI agents can predict likely outcomes or suggest next steps based on historical data and trends. For instance, they might predict the continuation of a text, the answer to a question, the next action for a robot, or the resolution of a scenario. 2) Decision Making: In some applications, AI agents can make decisions based on their inferences. Generally, the agent will base their decision on what is most likely to achieve a specified goal. For AI applications like recommendation systems, an agent can decide what products or content to recommend based on its inferences about user preferences. 3) Handling Ambiguity: AI agents can often handle ambiguous input by inferring the most likely interpretation based on context and training. However, their ability to do so is limited by the scope of their training data and algorithms. 4) Continuous Improvement: While some AI agents have the ability to learn from new data and interactions, many large language models do not continuously update their knowledge-base or internal representation after training. Their inferences are usually based solely on the data that was available up to the point of their last training update.
We show augmented interactive agents for multi-modality and cross reality-agnostic integration with an emergence mechanism in Fig. 2. An AI agent requires collecting extensive training data for every new task, which can be costly or impossible for many domains. In this study, we develop an infinite agent that learns to transfer memory information from

Figure 2: The multi-model agent AI for 2D/3D embodied generation and editing interaction in cross-reality.
general foundation models (e.g., GPT-X, DALL-E) to novel domains or scenarios for scene understanding, generation, and interactive editing in physical or virtual worlds.
An application of such an infinite agent in robotics is RoboGen (Wang et al., 2023d). In this study, the authors propose a pipeline that autonomously run the cycles of task proposition, environment generation, and skill learning. RoboGen is an effort to transfer the knowledge embedded in large models to robotics.
2.1 AI代理的能力与限制
AI代理能够根据其训练和输入数据进行解释、预测和响应。尽管这些能力非常先进并且持续改进,但仍需认识到它们的局限性以及其训练数据的影响。AI代理系统通常具备以下几种能力:
1预测建模:AI代理可以基于历史数据和趋势预测可能的结果或建议下一步。例如,它们可能会预测文本的继续部分、问题的答案、机器人执行的下一步动作,或场景的解决方案。
2决策制定:在某些应用中,AI代理可以基于其推理做出决策。通常,代理会基于最有可能实现指定目标的方式来做决策。例如,在推荐系统中,代理可以根据推测的用户偏好决定推荐哪些产品或内容。
3处理模糊性:AI代理通常能够通过推断最有可能的解释来处理模糊的输入,这依赖于上下文和训练数据。然而,它们处理模糊性的能力受到训练数据和算法范围的限制。
4持续改进:虽然一些AI代理能够从新数据和交互中学习,但许多大型语言模型在训练后并不会持续更新它们的知识库或内部表征。它们的推理通常仅基于训练更新时可用的数据。
我们展示了用于多模态和跨现实无关集成的增强型交互代理,并在图2中展示了一个涌现机制。一个AI代理需要为每个新任务收集大量的训练数据,这在许多领域可能是成本高昂或无法实现的。在本研究中,我们开发了一种无限代理,能够从通用基础模型(如GPT-X,DALL-E)中学习,将记忆信息转移到新的领域或场景中,以便在物理或虚拟世界中进行场景理解、生成和交互编辑。
这种无限代理在机器人领域的一个应用是RoboGen(Wang et al., 2023d)。在这项研究中,作者提出了一个自动化执行任务提议、环境生成和技能学习周期的管道。RoboGen的目标是将嵌入在大型模型中的知识转移到机器人领域。
2.2 Agent AI with Large Foundation Models(具有大型基础模型的代理AI)
Recent studies have indicated that large foundation models play a crucial role in creating data that act as benchmarks for determining the actions of agents within environment-imposed constraints. For example, using foundation models for robotic manipulation (Black et al., 2023; Ko et al., 2023) and navigation (Shah et al., 2023a; Zhou et al., 2023a). To illustrate, Black et al. employed an image-editing model as a high-level planner to generate images of future sub-goals, thereby guiding low-level policies (Black et al., 2023). For robot navigation, Shah et al. proposed a system that employs a LLM to identify landmarks from text and a VLM to associate these landmarks with visual inputs, enhancing navigation through natural language instructions (Shah et al., 2023a).
There is also growing interest in the generation of conditioned human motions in response to language and environmental factors. Several AI systems have been proposed to generate motions and actions that are tailored to specific linguistic instructions (Kim et al., 2023; Zhang et al., 2022; Tevet et al., 2022) and to adapt to various 3D scenes (Wang et al., 2022a). This body of research emphasizes the growing capabilities of generative models in enhancing the adaptability and responsiveness of AI agents across diverse scenarios.
2.2 具有大型基础模型的代理AI
最近的研究表明,大型基础模型在创建数据方面发挥着至关重要的作用,这些数据充当基准,用于确定智能体在环境施加的约束条件下的行动。例如,在机器人操作(Black et al., 2023; Ko et al., 2023)和导航(Shah et al., 2023a; Zhou et al., 2023a)中使用基础模型。为了说明,Black等人使用图像编辑模型作为高级规划器,生成未来子目标的图像,从而指导低级策略(Black et al., 2023)。对于机器人导航,Shah等人提出了一个系统,该系统使用LLM从文本中识别地标,并使用VLM将这些地标与视觉输入相关联,从而通过自然语言指令增强导航(Shah et al., 2023a)。
人们对生成响应于语言和环境因素的条件性人类运动的兴趣也日益浓厚。已经提出了几种人工智能系统,用于生成针对特定语言指令量身定制的运动和动作(Kim et al., 2023; Zhang et al., 2022; Tevet et al., 2022),并适应各种3D场景(Wang et al., 2022a)。这项研究强调了生成模型在增强人工智能智能体在不同场景中的适应性和响应能力方面的日益增长的能力。
2.2.1 Hallucinations(幻觉问题)
Agents that generate text are often prone to hallucinations, which are instances where the generated text is nonsensical or unfaithful to the provided source content (Raunak et al., 2021; Maynez et al., 2020). Hallucinations can be split into two categories, intrinsic and extrinsic (Ji et al., 2023). Intrinsic hallucinations are hallucinations that are contradictory
to the source material, whereas extrinsic hallucinations are when the generated text contains additional information that was not originally included in the source material.
Some promising routes for reducing the rate of hallucination in language generation involve using retrieval-augmented generation (Lewis et al., 2020; Shuster et al., 2021) or other methods for grounding natural language outputs via external knowledge retrieval (Dziri et al., 2021; Peng et al., 2023). Generally, these methods seek to augment language generation by retrieving additional source material and by providing mechanisms to check for contradictions between the generated response and the source material.
Within the context of multi-modal agent systems, VLMs have been shown to hallucinate as well (Zhou et al., 2023b). One common cause of hallucination for vision-based language-generation is due to the over-reliance on co-occurrence of objects and visual cues in the training data (Rohrbach et al., 2018). AI agents that exclusively rely upon pretrained LLMs or VLMs and use limited environment-specific finetuning can be particularly vulnerable to hallucinations since they rely upon the internal knowledge-base of the pretrained models for generating actions and may not accurately understand the dynamics of the world state in which they are deployed.
2.2.1 幻觉问题
生成文本的智能体常出现幻觉现象,即生成的文本缺乏逻辑或与源内容不符(Raunak等,2021;Maynez等,2020)。幻觉可分为两类:内在幻觉(intrinsic)与外在幻觉(extrinsic)(Ji等,2023)。内在幻觉指生成内容与源材料直接矛盾,而外在幻觉则指生成文本包含源材料未提及的额外信息。
降低语言模型幻觉率的有效方法包括:
1. 检索增强生成(Retrieval-Augmented Generation):通过外部知识检索机制(Lewis等,2020;Shuster等,2021)增强生成可靠性;
2. 矛盾检测机制:实时验证生成内容与源材料的一致性(Dziri等,2021;Peng等,2023)。
在多模态智能体系统中,视觉语言模型(VLM)同样存在幻觉问题(Zhou等,2023b)。视觉语言生成中的常见诱因是过度依赖训练数据中的物体共现模式和视觉线索(Rohrbach等,2018)。仅依赖预训练LLM/VLM且缺乏环境适配微调的AI智能体尤其脆弱——这类系统依赖预训练模型的内部知识库生成动作,但可能无法准确理解所处环境的状态动态。
(专业术语说明:LLM=大型语言模型;VLM=视觉语言模型;微调=基于特定场景的参数调整)
2.2.2 Biases and Inclusivity(偏见与包容性)
AI agents based on LLMs or LMMs (large multimodal models) have biases due to several factors inherent in their design and training process. When designing these AI agents, we must be mindful of being inclusive and aware of the needs of all end users and stakeholders. In the context of AI agents, inclusivity refers to the measures and principles employed to ensure that the agent’s responses and interactions are inclusive, respectful, and sensitive to a wide range of users from diverse backgrounds. We list key aspects of agent biases and inclusivity below.
● Training Data: Foundation models are trained on vast amounts of text data collected from the internet, including books, articles, websites, and other text sources. This data often reflects the biases present in human society, and the model can inadvertently learn and reproduce these biases. This includes stereotypes, prejudices, and slanted viewpoints related to race, gender, ethnicity, religion, and other personal attributes. In particular, by training on internet data and often only English text, models implicitly learn the cultural norms of Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies (Henrich et al., 2010) who have a disproportionately large internet presence. However, it is essential to recognize that datasets created by humans cannot be entirely devoid of bias, since they frequently mirror the societal biases and the predispositions of the individuals who generated and/or compiled the data initially.
●Historical and Cultural Biases: AI models are trained on large datasets sourced from diverse content. Thus, the training data often includes historical texts or materials from various cultures. In particular, training data from historical sources may contain offensive or derogatory language representing a particular society’s cultural norms, attitudes, and prejudices. This can lead to the model perpetuating outdated stereotypes or not fully understanding contemporary cultural shifts and nuances.
●Language and Context Limitations: Language models might struggle with understanding and accurately representing nuances in language, such as sarcasm, humor, or cultural references. This can lead to misinterpre- tations or biased responses in certain contexts. Furthermore, there are many aspects of spoken language that are not captured by pure text data, leading to a potential disconnect between human understanding of language and how models understand language.
●Policies and Guidelines: AI agents operate under strict policies and guidelines to ensure fairness and inclusivity. For instance, in generating images, there are rules to diversify depictions of people, avoiding stereotypes related to race, gender, and other attributes.
●Overgeneralization: These models tend to generate responses based on patterns seen in the training data. This can lead to overgeneralizations, where the model might produce responses that seem to stereotype or make broad assumptions about certain groups.
● Constant Monitoring and Updating: AI systems are continuously monitored and updated to address any emerging biases or inclusivity issues. Feedback from users and ongoing research in AI ethics play a crucial role in this process.
• Amplification of Dominant Views: Since the training data often includes more content from dominant cultures or groups, the model may be more biased towards these perspectives, potentially underrepresenting or misrepresenting minority viewpoints.
● Ethical and Inclusive Design: AI tools should be designed with ethical considerations and inclusivity as core principles. This includes respecting cultural differences, promoting diversity, and ensuring that the AI does not perpetuate harmful stereotypes.
● User Guidelines: Users are also guided on how to interact with AI in a manner that promotes inclusivity and respect. This includes refraining from requests that could lead to biased or inappropriate outputs. Furthermore, it can help mitigate models learning harmful material from user interactions.
Despite these measures, AI agents still exhibit biases. Ongoing efforts in agent AI research and development are focused on further reducing these biases and enhancing the inclusivity and fairness of agent AI systems. Efforts to Mitigate Biases:
●Diverse and Inclusive Training Data: Efforts are made to include a more diverse and inclusive range of sources in the training data.
●Bias Detection and Correction: Ongoing research focuses on detecting and correcting biases in model responses.
●Ethical Guidelines and Policies: Models are often governed by ethical guidelines and policies designed to mitigate biases and ensure respectful and inclusive interactions.
●Diverse Representation: Ensuring that the content generated or the responses provided by the AI agent represent a wide range of human experiences, cultures, ethnicities, and identities. This is particularly relevant in scenarios like image generation or narrative construction.
●Bias Mitigation: Actively working to reduce biases in the AI’s responses. This includes biases related to race, gender, age, disability, sexual orientation, and other personal characteristics. The goal is to provide fair and balanced responses that do not perpetuate stereotypes or prejudices.
● Cultural Sensitivity: The AI is designed to be culturally sensitive, acknowledging and respecting the diversity of cultural norms, practices, and values. This includes understanding and appropriately responding to cultural references and nuances.
• Accessibility: Ensuring that the AI agent is accessible to users with different abilities, including those with disabilities. This can involve incorporating features that make interactions easier for people with visual, auditory, motor, or cognitive impairments.
●Language-based Inclusivity: Providing support for multiple languages and dialects to cater to a global user base, and being sensitive to the nuances and variations within a language (Liu et al., 2023b).
●Ethical and Respectful Interactions: The Agent is programmed to interact ethically and respectfully with all users, avoiding responses that could be deemed offensive, harmful, or disrespectful.
●User Feedback and Adaptation: Incorporating user feedback to continually improve the inclusivity and effectiveness of the AI agent. This includes learning from interactions to better understand and serve a diverse user base.
●Compliance with Inclusivity Guidelines: Adhering to established guidelines and standards for inclusivity in AI agent, which are often set by industry groups, ethical boards, or regulatory bodies.
Despite these efforts, it’s important to be aware of the potential for biases in responses and to interpret them with critical thinking. Continuous improvements in AI agent technology and ethical practices aim to reduce these biases over time. One of the overarching goals for inclusivity in agent AI is to create an agent that is respectful and accessible to all users, regardless of their background or identity.
2.2.2 偏见与包容性
基于大型语言模型(LLMs)或大型多模态模型(LMMs)的AI代理,由于其设计和训练过程中的多种因素,可能存在偏见。在设计这些AI代理时,我们必须关注包容性,关注所有最终用户和利益相关者的需求。在AI代理的背景下,包容性指的是采取措施和原则,确保代理的响应和互动具有包容性、尊重性,并对来自不同背景的广泛用户敏感。以下是代理偏见和包容性的关键方面:
●训练数据:基础模型在大量从互联网收集的文本数据上进行训练,包括书籍、文章、网站和其他文本来源。这些数据通常反映了人类社会的偏见,模型可能无意中学习并再现这些偏见,包括与种族、性别、民族、宗教和其他个人属性相关的刻板印象、偏见和片面观点。特别是,通过训练互联网数据,模型隐含地学习了西方、受过教育、工业化、富裕和民主(WEIRD)社会的文化规范,这些社会在互联网上的存在感过于强大。然而,必须认识到,由人类创建的数据集无法完全消除偏见,因为它们经常反映了社会偏见和生成和/或编译数据的个人的偏见。
●历史和文化偏见:AI模型在多样化内容的大型数据集上进行训练。因此,训练数据通常包括历史文本或来自不同文化的材料。特别是,来自历史来源的训练数据可能包含代表特定社会文化规范、态度和偏见的冒犯性或贬损性语言。这可能导致模型延续过时的刻板印象,或无法充分理解当代文化的变化和细微差别。
●语言和上下文限制:语言模型可能难以理解和准确表示语言中的细微差别,如讽刺、幽默或文化参考。这可能导致在某些上下文中产生误解或偏见的响应。此外,口语语言的许多方面未被纯文本数据捕捉,导致人类对语言的理解与模型对语言的理解之间可能存在脱节。
●政策和指南:AI代理在严格的政策和指南下操作,以确保公平性和包容性。例如,在生成图像时,有规则要求多样化地描绘人物,避免与种族、性别和其他属性相关的刻板印象。
●过度概括:这些模型倾向于基于训练数据中看到的模式生成响应。这可能导致过度概括,模型可能产生看似对某些群体的刻板印象或广泛假设的响应。
●持续监控和更新:AI系统持续受到监控和更新,以解决任何新出现的偏见或包容性问题。用户反馈和AI伦理学的持续研究在此过程中发挥着关键作用。
●主流观点的放大:由于训练数据通常包含更多来自主流文化或群体的内容,模型可能更偏向于这些观点,可能低估或误代表少数观点。
●伦理和包容性设计:AI工具应以伦理考虑和包容性为核心原则进行设计。这包括尊重文化差异,促进多样性,并确保AI不延续有害的刻板印象。
●用户指南:还应指导用户如何以促进包容性和尊重的方式与AI互动。这包括避免可能导致偏见或不适当输出的请求。此外,这有助于减少模型从用户互动中学习有害材料的可能性。
尽管采取了这些措施,AI代理仍然存在偏见。代理AI研究和开发的持续努力集中在进一步减少这些偏见,增强代理AI系统的包容性和公平性。为减少偏见的努力包括:
●多样化和包容性的训练数据:努力在训练数据中包含更多样化和包容性的来源。
●偏见检测和纠正:持续的研究集中在检测和纠正模型响应中的偏见。
●伦理指南和政策:模型通常受伦理指南和政策的约束,旨在减少偏见,确保尊重和包容的互动。
●多样化表现:确保AI代理生成的内容或提供的响应代表广泛的人类经验、文化、种族和身份。这在图像生成或叙事构建等场景中特别相关。
●偏见缓解:积极努力减少AI响应中的偏见,包括与种族、性别、年龄、残疾、性取向和其他个人特征相关的偏见。目标是提供公平和平衡的响应,不延续刻板印象或偏见。
●文化敏感性:AI被设计为具有文化敏感性,承认和尊重文化规范、实践和价值观的多样性。这包括理解和适当回应文化参考和细微差别。
●可访问性:确保AI代理对具有不同能力的用户可访问,包括残疾人士。这可能涉及纳入使视觉、听觉、运动或认知障碍者更容易互动的功能。
●基于语言的包容性:提供对多种语言和方言的支持,以满足全球用户群,并对语言内的细微差别和变化保持敏感。
●伦理和尊重的互动:代理被编程以伦理和尊重的方式与所有用户互动,避免可能被视为冒犯、有害或不尊重的响应。
●用户反馈和适应:纳入用户反馈,持续改进AI代理的包容性和有效性。这包括从互动中学习,更好地理解和服务于多元化的用户群。
●遵守包容性指南:遵守为AI代理制定的包容性指南和标准,这些通常由行业团体、伦理委员会或监管机构设定。
尽管采取了这些努力,但仍需注意响应中可能存在的偏见,并以批判性思维进行解读。AI代理技术和伦理实践的持续改进旨在随着时间的推移减少这些偏见。代理AI包容性的总体目标之一是创建一个对所有用户都尊重和可访问的代理,无论其背景或身份如何。
2.2.3 Data Privacy and Usage(数据隐私与使用)
One key ethical consideration of AI agents involves comprehending how these systems handle, store, and potentially retrieve user data. We discuss key aspects below:
Data Collection, Usage and Purpose. When using user data to improve model performance, model developers access the data the AI agent has collected while in production and interacting with users. Some systems allow users to view their data through user accounts or by making a request to the service provider. It is important to recognize what data the AI agent collects during these interactions. This could include text inputs, user usage patterns, personal preferences, and sometimes more sensitive personal information. Users should also understand how the data collected from their interactions is used. If, for some reason, the AI holds incorrect information about a particular person or group, there should be a mechanism for users to help correct this once identified. This is important for both accuracy and to be respectful of all users and groups. Common uses for retrieving and analyzing user data include improving user interaction, personalizing responses, and system optimization. It is extremely important for developers to ensure the data is not used for purposes that users have not consented to, such as unsolicited marketing.
Storage and Security. Developers should know where the user interaction data is stored and what security measures are in place to protect it from unauthorized access or breaches. This includes encryption, secure servers, and data protection protocols. It is extremely important to determine if agent data is shared with third parties and under what conditions. This should be transparent and typically requires user consent.
Data Deletion and Retention. It is also important for users to understand how long user data is stored and how users can request its deletion. Many data protection laws give users the right to be forgotten, meaning they can request their data be erased. AI agents must adhere to data protection laws like GDPR in the EU or CCPA in California. These laws govern data handling practices and user rights regarding their personal data.
Data Portability and Privacy Policy. Furthermore, developers must create the AI agent’s privacy policy to document and explain to users how their data is handled. This should detail data collection, usage, storage, and user rights. Developers should ensure that they obtain user consent for data collection, especially for sensitive information. Users typically have the option to opt-out or limit the data they provide. In some jurisdictions, users may even have the right to request a copy of their data in a format that can be transferred to another service provider.
Anonymization. For data used in broader analysis or AI training, it should ideally be anonymized to protect individual identities. Developers must understand how their AI agent retrieves and uses historical user data during interactions. This could be for personalization or improving response relevance.
In summary, understanding data privacy for AI agents involves being aware of how user data is collected, used, stored, and protected, and ensuring that users understand their rights regarding accessing, correcting, and deleting their data. Awareness of the mechanisms for data retrieval, both by users and the AI agent, is also crucial for a comprehensive understanding of data privacy.
2.2.3 数据隐私与使用
在人工智能(AI)代理的伦理考量中,理解这些系统如何处理、存储和可能检索用户数据至关重要。以下是关键方面的讨论:
●数据收集、使用和目的:用用户数据提升模型性能时,开发者会访问AI代理在生产环境中与用户互动时收集的数据。某允许用户通过账户查看其数据或向服务提供商提出请求。了解A在这些互动中收集的数据类型至关重要,包括文本输入、用户使用模式、个人偏好,甚至更敏感的个人信息。用户应了解数据的使用方式。如果AI持有关个人或群体的不正确信息,应有机制让用户在识别后进行更正,以确保准确性并尊重所有用户和群体。常见的数据检索和分包括改善用户互动、个性化响应和系统优化。开发者必须确保数据不会户未同意的目的,如未经请求的营销。
●**存储和安全性*发者应了解用户互动数据的存储,以及为防止未经授权的访问或泄露所采取的安全措施,包括加密、安全服务器和数据保护协议。确定代理数据是否与第三方共享以及件也至关重要,这应具有透明度,通常需要用户同意。
●数据删除和保留:用户其数据的存储期限,以及如何请求。许多数据保护法律赋予用户被遗忘的权利,即请其数据。AI代理必须遵守如欧盟的通用数据保护条例(GD或加州消费者隐私法案(CCPA)等数据保护法律,这些法律规范数据处理实践和用户对个人数据的权利。
●数据可携带性和隐私政策:开发者应制代理的隐私政策,记录并向用户解释其数据理方式,包括数据收集、使用、存储和用户权利。开发者应确保在收集数据时获得用户同意,特别是对于敏感信息通常有权选择退出或限制其提供的数据。在某些司法管辖区,用有权请求以可转移到其他服务提供商的格式获取其数据副本。
●匿名化:用于广泛分析或AI训练的数据应理想地进行,以保护个人身份。开发须了解其AI代理在互动过程中如何检索和使用历史用户数据,这可能性化或提高响应相关性。
总之,理解AI代理的数据隐私涉及了解用户数据的收集、使用、存储和保护方确户了解其访问、更正和删除数据的权利。了解用户和AI代理的数据检索机制对于全面理解数据隐私也至关重要。
2.2.4 Interpretability and Explainability(可解释性与可解释性)
Imitation Learning → Decoupling. Agents are typically trained using a continuous feedback loop in Reinforcement Learning (RL) or Imitation Learning (IL), starting with a randomly initialized policy. However, this approach faces leader-board in obtaining initial rewards in unfamiliar environments, particularly when rewards are sparse or only available at the end of a long-step interaction. Thus, a superior solution is to use an infinite-memory agent trained through IL, which can learn policies from expert data, improving exploration and utilization of unseen environmental space with emergent infrastructure as shown in Fig. 3. With expert characteristics to help the agent explore better and utilize the unseen environmental space. Agent AI, can learn policies and new paradigm flow directly from expert data.
Traditional IL has an agent mimicking an expert demonstrator’s behavior to learn a policy. However, learning the expert policy directly may not always be the best approach, as the agent may not generalize well to unseen situations. To tackle this, we propose learning an agent with in-context prompt or a implicit reward function that captures key aspects of the expert’s behavior, as shown in Fig. 3. This equips the infinite memory agent with physical-world behavior data for task execution, learned from expert demonstrations. It helps overcome existing imitation learning drawbacks like the need for extensive expert data and potential errors in complex tasks. The key idea behind the Agent AI has two parts: 1) the infinite agent that collects physical-world expert demonstrations as state-action pairs and 2) the virtual environment that imitates the agent generator. The imitating agent produces actions that mimic the expert’s behavior, while the agent learns a policy mapping from states to actions by reducing a loss function of the disparity between the expert’s actions and the actions generated by the learned policy.

Decoupling → Generalization. Rather than relying on a task-specific reward function, the agent learns from expert demonstrations, which provide a diverse set of state-action pairs covering various task aspects. The agent then learns a policy that maps states to actions by imitating the expert’s behavior. Decoupling in imitation learning refers to separating the learning process from the task-specific reward function, allowing the policy to generalize across different tasks without explicit reliance on the task-specific reward function. By decoupling, the agent can learn from expert demonstrations and learn a policy that is adaptable to a variety of situations. Decoupling enables transfer learning, where a policy learned in one domain can adapt to others with minimal fine-tuning. By learning a general policy that is not tied to a specific reward function, the agent can leverage the knowledge it acquired in one task to perform well in other related tasks. Since the agent does not rely on a specific reward function, it can adapt to changes in the reward function or environment without the need for significant retraining. This makes the learned policy more robust and generalizable across different environments. Decoupling in this context refers to the separation of two tasks in the learning process: learning the reward function and learning the optimal policy.
Generalization → Emergent Behavior. Generaliza- tion explains how emergent properties or behaviors can arise from simpler components or rules. The key idea lies in identifying the basic elements or rules that govern the behavior of the system, such as individual neurons or basic algorithms. Consequently, by observ- ing how these simple components or rules interact with one another. These interactions of these components of- ten lead to the emergence of complex behaviors, which are not predictable by examining individual compo- nents alone. Generalization across different levels of complexity allows a system to learn general princi- ples applicable across these levels, leading to emergent properties. This enables the system to adapt to new situations, demonstrating the emergence of more com- plex behaviors from simpler rules. Furthermore, the ability to generalize across different complexity levels facilitates knowledge transfer from one domain to an- other, which contributes to the emergence of complex behaviors in new contexts as the system adapts.
2.2.4 可解释性与可解释性
模仿学习 → 去耦合
通常通过强化学习(RL)或模仿学习(IL)在连续反馈循环中进行训练,从随机初始化的策略开始。然而,这种方法在获取陌生环境中的初始奖励时面临挑战,特别是当奖励稀疏或仅在长时间交互结束时可用时。 ,更优的解决方案是使用通过模仿学习训练的无限记忆代理,它可以从专家数据中学习策略,改善对未知环境空间的探索和利用,如图3所示。
传统学代理模仿专家演示者的行为来学习策略。然而,直接学习专家策略可能并非最佳方法,因为代理可能无法很好地泛化到未见过的情况。为了解决这个问题,我们提出了通过上下文提示或隐式奖励函数来学习代理,这些方法捕捉了专家行为的关键方面,如图3所示。
这使得无限理了从专家演示中学习的物理世界行为数据,用于任务执行。它有助于克服现有模仿学习的缺点,如对大量专家数据的需求和在复杂任务中的潜在错误。代理AI的关键思想包括:1)无限代理收集物理世界专家演示作为状态-动作对;2)虚拟环境模仿代理生成器。模仿代理生成模仿专家行为的动作,而代理通过减少专家动作与学习策略生成的动作之间差异的损失函数,学习从状态到动作的策略映射。
**去耦合 → 泛化 *
代理通过模仿专家演示来学而不是依赖于特定任务的奖励函数,专家演示提供了涵盖各种任务方面的多样化状态-动作对。代理然后通过模仿专家行为来学习从状态到动作的策略。模仿学习中的去耦合指的是将学习过程与特定任务的奖励函数分离,使策略能够在不同任务之间泛化,而无需明确依赖于特定任务的奖励函数。通过去耦合,代理可以从专家演示中学习,并学习适应各种情况的策略。去耦合使得迁移学习成为可能,即在一个领域中学习的策略可以通过最小的微调适应其他领域。通过学习不依赖于特定奖励函数的通用策略,代理可以利用在一个任务中获得的知识,在其他相关任务中表现良好。由于代理不依赖于特定的奖励函数,它可以在奖励函数或环境发生变化时适应,而无需进行大量的重新训练。这使得学习的策略在不同环境中更具鲁棒性和泛化能力。
**泛化 → 涌现行为 泛化解释了如何从更简单的组件或中涌现出复杂的特性或行为。关键在于识别支配系统行为的基本元素或规则,如单个神经元或基本算法。通过观察这些简单组件或规则如何相互作用,这些相互作用通常导致复杂行为的涌现,而这些行为无法仅通过检查单个组件来预测。不同复杂性层次之间的泛化使系统能够学习适用于这些层次的通用原则,从而导致涌现特性。这使得系统能够适应新情况,展示从更简单规则中涌现出更复杂行为的能力。此外,跨不同复杂性层次的泛化能力促进了知识从一个领域到另一个领域的转移,这有助于系统在适应新环境时涌现出复杂行为。
2.2.5 Inference Augmentation(推理增强)
The inference ability of an AI agent lies in its capacity to interpret, predict, and respond based on its training and input data. While these capabilities are advanced and continually improving, it’s important to recognize their limitations and the influence of the underlying data they are trained on. Particularly, in the context of large language models, it refers to its capacity to draw conclusions, make predictions, and generate responses based on the data it has been trained on and the input it receives. Inference augmentation in AI agents refers to enhancing the AI’s natural inference abilities with additional tools, techniques, or data to improve its performance, accuracy, and utility. This can be particularly important in complex decision-making scenarios or when dealing with nuanced or specialized content. We denote particularly important sources for inference augmentation below:
Data Enrichment. Incorporating additional, often external, data sources to provide more context or background can help the AI agent make more informed inferences, especially in areas where its training data may be limited. For example, AI agents can infer meaning from the context of a conversation or text. They analyze the given information and use it to understand the intent and relevant details of user queries. These models are proficient at recognizing patterns in data. They use this ability to make inferences about language, user behavior, or other relevant phenomena based on the patterns they’ve learned during training.
Algorithm Enhancement. Improving the AI’s underlying algorithms to make better inferences. This could involve using more advanced machine learning models, integrating different types of AI (like combining NLP with image recognition), or updating algorithms to better handle complex tasks. Inference in language models involves understand- ing and generating human language. This includes grasping nuances like tone, intent, and the subtleties of different linguistic constructions.
Human-in-the-Loop (HITL). Involving human input to augment the AI’s inferences can be particularly useful in areas where human judgment is crucial, such as ethical considerations, creative tasks, or ambiguous scenarios. Humans can provide guidance, correct errors, or offer insights that the agent would not be able to infer on its own.
Real-Time Feedback Integration. Using real-time feedback from users or the environment to enhance inferences is another promising method for improving performance during inference. For example, an AI might adjust its recommendations based on live user responses or changing conditions in a dynamic system. Or, if the agent is taking actions in a simulated environment that break certain rules, the agent can be dynamically given feedback to help correct itself.
Cross-Domain Knowledge Transfer. Leveraging knowledge or models from one domain to improve inferences in another can be particularly helpful when producing outputs within a specialized discipline. For instance, techniques developed for language translation might be applied to code generation, or insights from medical diagnostics could enhance predictive maintenance in machinery.
Customization for Specific Use Cases. Tailoring the AI’s inference capabilities for particular applications or industries can involve training the AI on specialized datasets or fine-tuning its models to better suit specific tasks, such as legal analysis, medical diagnosis, or financial forecasting. Since the particular language or information within one domain can greatly contrast with the language from other domains, it can be beneficial to finetune the agent on domain-specific information.
Ethical and Bias Considerations. It is important to ensure that the augmentation process does not introduce new biases or ethical issues. This involves careful consideration of the sources of additional data or the impact of the new inference augmentation algorithms on fairness and transparency. When making inferences, especially about sensitive topics, AI agents must sometimes navigate ethical considerations. This involves avoiding harmful stereotypes, respecting privacy, and ensuring fairness.
Continuous Learning and Adaptation. Regularly updating and refining the AI’s capabilities to keep up with new developments, changing data landscapes, and evolving user needs.
In summmary, winference augmentation in AI agents involves methods in which their natural inference abilities can be enhanced through additional data, improved algorithms, human input, and other techniques. Depending on the use-case, this augmentation is often essential for dealing with complex tasks and ensuring accuracy in the agent’s outputs.
2.2.5 推理增强
代理的推理能力是其根据训练和输入数据进行解释、预测和响应的能力。尽能力在不断进步,但认识到其局限性和训练数据的影响仍然至关重要。特别是语言模型的背景下,推理增强指的是通过额外的工具、技术或数据来提升AI的自然推理能力,以提高其性能、准确性和实用性。这在复杂决或处理细致或专业内容时尤为重要。
以下是推理关键来源:
●数据丰富化:引入额外的外部数据以提供更多的上下文或背景,帮助AI代理在训练数据可能有限的领域做出更明智的推理。例如,AI代理可以从对话或文本的上下文中推断含义,分析给定信息,以理解用户查询的意图和相关细节。
●**算法增强改进AI的基础算法,以实好的推理能力。这可能涉及使用更先进的机器学习模型,整合不同类型的AI(如将自然语言处理与图像识别相结合),或更新算法以更好地处理复杂任务。
●人机协同(HI:在推理过程中引入人工输入,特别伦理考量、创造性任务或模糊场景中,人工输入可以提供指导、纠正错误或提供AI无法自行推断的见解。
●实时反馈整合:利用户或环境的实时反馈来增强推力。例如,AI可能根据实时用户响应或动态系统中的变化条件调整其推荐,或在模拟环境中,代理在违反某些规则时可以动态地获得反馈,以帮助其自我纠正。
●跨领域知识迁移:利用一的知识或模型来改善另一个领域的能力,特别是在专业学科内生成输出时。例如,语言翻译技术可能应用于代码生成,或医学诊断的见解可以增强机械设备的预测性维护。
●特定用例的定制化:为特定应用量身定制AI的推理能力,可能涉及业数据集上训练AI,或微调其模型以更好地适应特定任务,如法律分析、医学诊断或金融预测。由于一个领域的特定语言或信息可能与其他领域的语言大相径庭,因此在领域特定信息上微调代理可能是有益的。
●伦理和偏见考量:确保增强过程不会引偏见或伦理问题,涉及仔细考虑额据源的来源或新的推理增强算法对公平性和透明度的影响。在进行推理时,特别是关于敏感话题,AI代理必须有时需要处理伦理考量,包括避免有害的刻板印象、尊重隐私和确保公平性。
●持续学习和适应:定期更新和完善AI的能力上新发展、变化的数据环境和不断的用户需求。
总之,AI代理的推理增强涉及通过额外的数据、改进的算法、人和技术来提升其自然推理能力。根据具体应用场景,这种增强对于处理复杂任务和确保代理输出的准确性通常是必不可少的。
2.2.6 Regulation (规范)
Recently, Agent AI has made significant advancements, and its integration into embodied systems has opened new possibilities for interacting with agents via more immersive, dynamic, and engaging experiences. To expedite the process and ease the cumbersome work in agent AI developing, we are proposing to develop the next-generation AI-empowered pipeline for agent interaction. Develop a human-machine collaboration system where humans and machines can communicate and interact meaningfully. The system can leverage the LLM’s or VLM dialog capabilities and vast action to talk with human players and identify human needs. Then it will perform proper actions to help human players upon request.
When employing LLM/VLMs for a human-machine collaboration system, it is essential to note that these operate as black boxes, generating unpredictable output. This uncertainty can become crucial in a physical setup, such as operating actual robotics. An approach to address this challenge is constraining the focus of the LLM/VLM through prompt engineering. For instance, in robotic task planning from instructions, providing environmental information within the prompt has been reported to yield more stable outputs than relying solely on text (Gramopadhye and Szafir, 2022). This

Figure 4: A robot teaching system developed in (Wake et al., 2023c). (Left) The system workflow. The process involves three steps: Task planning, where ChatGPT plans robotic tasks from instructions and environmental information; Demonstration, where the user visually demonstrates the action sequence. All the steps are reviewed by the user, and if any step fails or shows deficiencies, the previous steps can be revisited as necessary. (Right) A web application that enables uploading of demonstration data and the interaction between the user and ChatGPT.
report is supported by the Minsky’s frame theory of AI (Minsky, 1975), suggesting that the problem space to be solved by LLM/VLMs is defined by the given prompts. Another approach is designing prompts to make LLM/VLMs include explanatory text to allow users understand what the model has focused on or recognized. Additionally, implementing a higher layer that allows for pre-execution verification and modification under human guidance can facilitate the operation of systems working under such guidance (Fig. 4).
### 2.2.6 规范
最近,代理AI已经取得了显著的进步,其与具身系统的集成开启了通过更加沉浸式、动态和吸引人的体验与代理互动的新可能性。为了加快这一过程并减轻代理AI开发中的繁琐工作,我们提议开发下一代由AI赋能的代理交互管道。开发一个人机协作系统,在这个系统中人类和机器可以进行有意义的沟通和互动。该系统可以利用大型语言模型(LLM)或视觉语言模型(VLM)的对话能力以及广泛的动作来与人类玩家交流,并识别他们的需求。然后根据请求执行适当的行动以帮助人类玩家。
在为一个人机协作系统部署LLM/VLM时,需要注意的是这些模型作为黑箱操作,会产生不可预测的输出。这种不确定性在物理设置中尤为重要,例如实际操作机器人时。解决这一挑战的一个方法是通过提示工程来约束LLM/VLM的关注点。例如,在从指令进行机器人任务规划时,有报道指出在提示中提供环境信息比单纯依赖文本能产生更稳定的输出(Gramopadhye 和 Szafir, 2022)。这一报告得到了明斯基的人工智能框架理论的支持(Minsky, 1975),表明LLM/VLM要解决的问题空间是由给定的提示定义的。
图4展示了一个由(Wake等人,2023c)开发的机器人教学系统。(左图)显示了系统的流程,涉及三个步骤:任务规划,其中ChatGPT根据指令和环境信息规划机器人任务;演示,用户以可视化的方式演示动作序列。所有步骤都由用户审查,如果任何步骤失败或显示出不足,可以根据需要重新审视之前的步骤。(右图)是一个允许上传演示数据以及用户与ChatGPT之间互动的网络应用程序。
另一种方法是设计提示,使LLM/VLM包含解释性文本,让用户了解模型关注的内容或识别的信息。此外,实施一个更高层次的结构,允许在人类指导下进行预执行验证和修改,可以促进此类指导下的系统操作(如图4所示)。这种方法不仅提高了系统的可靠性和安全性,还增强了用户体验,使得人机协作更加流畅和高效。
请注意,原文中提到的具体研究和理论(如Gramopadhye和Szafir的工作,以及Minsky的框架理论)可能需要查阅相关文献以获得详细的背景信息和技术细节。在翻译过程中,我尽可能保持了原文的技术准确性,同时确保内容易于理解。如果需要进一步的详细信息或者具体的引用,请告知。
2.3 Agent AI for Emergent Abilities(代理AI与涌现能力)

Despite the growing adoption of interactive agent AI systems, the majority of proposed methods still face a challenge in terms of their generalization performance in unseen environments or scenarios. Current modeling practices require developers to prepare large datasets for each domain to finetune/pretrain models; however, this process is costly and even impossible if the domain is new. To address this issue, we build interactive agents that leverage the knowledge-memory of general-purpose foundation models (ChatGPT, Dall-E, GPT-4, etc.) for a novel scenario, specifically for generating a collaboration space between humans and agents. We discover an emergent mechanism— which we name Mixed Reality with Knowledge Inference Interaction—that facilitates collaboration with humans to solve challenging tasks in complex real-world environments and enables the exploration of unseen environments for adaptation to virtual reality. For this mechanism, the agent learns i) micro-reactions in cross-modality: collecting relevant individual knowledge for each interaction task (e.g., understanding unseen scenes) from the explicit web source and by implicitly inferring from the output of pretrained models; ii) macro-behavior in reality-agnostic: improving interactive dimensions and patterns in language and multi-modality domains, and make changes based on characterized roles, certain target variable, influenced diversification of collaborative information in mixed-reality and LLMs. We investigate the task of knowledge-guided interactive synergistic effects to collaborated scene generation with combining various OpenAI models, and show promising results of how the interactive agent system can further boost the large foundation models in our setting. It integrates and improves the depth of generalization, conscious and interpretability of a complex adaptive AI systems.
### 2.3 代理AI与涌现能力
尽管互动式代理AI系统越来越受到欢迎,但大多数提出的方法在面对未见过的环境或场景时,在泛化性能方面仍面临挑战。当前的建模实践要求开发者为每个领域准备大型数据集以微调或预训练模型;然而,这个过程既昂贵,对于新领域来说甚至不可能实现。为了解决这个问题,我们构建了能够利用通用基础模型(如ChatGPT、Dall-E、GPT-4等)知识记忆的互动代理,特别针对创建人类与代理之间的协作空间这一新颖情境。
我们发现了一种名为“结合知识推断交互的混合现实”的涌现机制——这种机制促进了人类和代理之间解决复杂现实世界环境中具有挑战性任务的合作,并允许探索未知环境以适应虚拟现实。在这种机制下,代理学习了两种主要的行为:一是跨模态中的微观反应——从明确的网络资源收集与每项互动任务相关的个体知识(例如理解未曾见过的场景),并通过预训练模型的输出隐式地推断;二是现实无关中的宏观行为——在语言和多模态领域中改进互动维度和模式,并根据特定角色、目标变量的变化以及混合现实中合作信息的多样化影响做出调整。
我们研究了知识引导的互动协同效应任务,旨在通过结合各种OpenAI模型来生成协作场景,并展示了互动代理系统如何进一步提升大型基础模型在我们设定中的表现。它整合并深化了复杂自适应AI系统的泛化能力、意识性和可解释性。这种方法不仅提高了系统的灵活性和响应速度,还为开发更加智能和用户友好的人工智能应用铺平了道路。
这种新兴的代理AI方法强调了在没有先验经验的情况下对新环境的学习和适应能力,这对于推动人工智能向更广泛的应用场景扩展至关重要。通过将不同的信息源和预训练模型结合起来,代理AI能够在理解和处理复杂的实际问题上达到新的高度,同时保持对用户需求的高度敏感和响应性。这样,代理AI不仅成为了解决问题的强大工具,也为未来的人机协作提供了无限可能。
3 Agent AI Paradigm(代理人工智能范式)
3 Agent AI Paradigm
In this section, we discuss a new paradigm and framework for training Agent AI. We seek to accomplish several goals with our proposed framework:
●Make use of existing pre-trained models and pre-training strategies to effectively bootstrap our agents with effective understanding of important modalities, such as text or visual inputs.
● Support for sufficient long-term task-planning capabilities.
● Incorporate a framework for memory that allows for learned knowledge to be encoded and retrieved later.
●Allow for environmental feedback to be used to effectively train the agent to learn which actions to take.
● We show a high-level new agent diagram outlining the important submodules of such a system in Fig. 5.
3 代理人工智能范式 在本节中,我们讨论了一种用于训练代理人工智能的新范式和框架。我们希望通过所提出的框架实现以下几个目标:
● 利用现有的预训练模型和预训练策略,有效地为我们的代理奠定基础,使其能够对重要的模态(例如文本或视觉输入)形成深刻理解。 ● 支持足够的长期任务规划能力。 ● 融入一个内存框架,使得已学到的知识能够被编码并在之后检索。 ● 允许利用环境反馈来有效训练代理,从而使其学会采取适当的行动。 ● 我们在图 5 中展示了一个高级代理示意图,概述了该系统各个重要子模块的结构。

3.1 LLMs and VLMs
We can use the LLM or VLM model to bootstrap the components of the Agent as showed in Fig. 5. In particular, LLMs have been shown to perform well for task-planning (Gong et al., 2023a), contain significant world knowledge (Yu et al., 2023b), and display impressive logical reasoning capabilities (Creswell et al., 2022). Additionally, VLMs such as CLIP (Radford et al., 2021) provide a general visual encoder that is language-aligned, as well as providing zero-shot visual recognition capabilities. For example, state-of-the-art open-source multi-modal models such as LLaVA (Liu et al., 2023c) and InstructBLIP (Dai et al., 2023) rely upon frozen CLIP models as visual encoders.
3.1 LLMs and VLMs
我们可以利用大型语言模型 (LLM) 或视觉语言模型 (VLM) 来引导构建人工智能主体的各个组件,如图 5 所示。 特别是,大型语言模型已被证明在任务规划方面表现出色 (Gong et al., 2023a),蕴含着丰富的世界知识 (Yu et al., 2023b),并展现出令人印象深刻的逻辑推理能力 (Creswell et al., 2022)。 此外,像 CLIP (Radford et al., 2021) 这样的视觉语言模型提供了一个通用的、与语言对齐的视觉编码器,以及零样本视觉识别能力。 例如,最先进的开源多模态模型,如 LLaVA (Liu et al., 2023c) 和 InstructBLIP (Dai et al., 2023),都依赖于冻结的 CLIP 模型作为视觉编码器。
3.2 Agent Transformer Definition
Instead of using frozen LLMs and VLMs for the AI agent, it is also possible to use a single-agent transformer model that takes visual tokens and language tokens as input, similar to Gato (Reed et al., 2022). In addition to vision and language, we add a third general type of input, which we denote as agent tokens. Conceptually, agent tokens are used to reserve a specific subspace of the input and output space of the model for agentic behaviors. For robotics or game playing, this may be represented as the input action space of the controller. When training agents to use specific tools, such as image-generation or image-editing models, or for other API calls, agent tokens can also be used. As showed in Fig. 7, we can combine the agent tokens with visual and language tokens to generate a unified interface for training multi-modal agent AI. Compared to using large, proprietary LLMs as agents, there are several advantages to using an agent transformer. Firstly, the model can be easily customized to very specific agentic tasks that may be difficult to represent in natural language (e.g. controller inputs or other specific actions). Thus, the agent can learn from environmental interactions and domain-specific data to improve performance. Secondly, it can be easier to understand why the model does or does not take specific actions by having access to the probabilities of the agent tokens. Thirdly,
there are certain domains such as healthcare and law that have strict data privacy requirements. Finally, a relatively smaller agent transformer can potentially be significantly cheaper than a larger proprietary language model.

Figure 6: We show the current paradigm for creating multi-modal AI agents by incorporating a Large Language Model (LLM) with a Large Vision Model (LVM). Generally, these models take visual or language inputs and use pre-trained and frozen visual and language models, learning smaller sub-network that connect and bridge modalities. Examples include Flamingo (Alayrac et al., 2022), BLIP-2 (Li et al., 2023c), InstructBLIP (Dai et al., 2023), and LLaVA (Liu et al., 2023c).

Figure 7: The unified agent multi-modal transformer model. Instead of connecting frozen submodules and using existing foundation models as building blocks, we propose a unified and end-to-end training paradigm for agent systems. We can still initialize the submodules with LLMs and LVMs as in Figure 6 but also make use of agent tokens, specialized tokens for training the model to perform agentic behaviors in a specific domain (e.g., robotics). For more details about agent tokens, see Section 3.2
与其使用固定的大型语言模型(LLMs)和视觉语言模型(VLMs)来构建AI代理,我们还可以采用一种单一代理的Transformer模型,这种模型可以接受视觉标记和语言标记作为输入,类似于Gato (Reed et al., 2022) 的工作方式。除了视觉和语言输入外,我们还引入了第三种通用类型的输入,我们称之为代理标记(agent tokens)。从概念上讲,代理标记用于为模型的输入和输出空间中保留一个特定的子空间,专门用于代理行为。对于机器人技术或游戏玩法而言,这可能表现为控制器的输入动作空间。当训练代理使用特定工具时,比如图像生成或图像编辑模型,或者进行其他API调用,也可以使用代理标记。如图7所示,我们可以将代理标记与视觉和语言标记结合起来,创建一个多模态代理AI的统一接口 。
与使用大型专有的LLM作为代理相比,使用代理Transformer有几个优势:
首先,该模型可以根据非常具体的代理任务进行定制,这些任务可能难以用自然语言表示(例如控制器输入或其他特定操作)。因此,代理可以从环境互动和领域特定的数据中学习,以提高性能 。
其次,通过访问代理标记的概率,可以更容易地理解模型为何执行或不执行某些特定操作,这有助于调试和改进模型的行为 。
再次,在一些有严格数据隐私要求的领域,比如医疗保健和法律,使用较小规模的代理Transformer可以更好地满足这些要求。因为较小的模型意味着更少的数据暴露风险,同时也更容易实施严格的访问控制措施。
最后,相对较小的代理Transformer可能会比更大的专有语言模型便宜得多。这对于预算有限的研究团队或企业来说尤其重要,因为他们可以在不过度投资的情况下开发出高效的代理系统 。
综上所述,通过采用单一代理Transformer模型并结合多种类型的标记,不仅可以实现对特定任务的高度定制化,还能在确保数据隐私的同时降低成本,从而为多模态代理AI的发展提供了新的可能性。
3.3 Agent Transformer Creation
As shown above in Fig. 5, we can use the new agent paradigm with LLM and VLM-bootstrapped agents, as well as leveraging data generated from large foundation models to train the agent transformer model for learning to execute specific goals. Within this process, the agent model is trained to be specialized and tailored for specific tasks and domains. This approach allows you to leverage a pre-existing, foundation model’s learned features and knowledge. We show a simplified overview of the process in two steps below:
Define Objectives within the Domain. In order to train the agent transformer, the objectives and the action-space of the agent within the context of each specific environment needs to be clearly defined. This includes determining which specific tasks or actions the agent needs to perform and assigning unique agent tokens for each. Furthermore, any automatic rules or procedures that can be used to identify successful completion of tasks can significantly improve the amount of data available for training. Otherwise, foundation-model generated or human-annotated data will be required for training the model. After the data is collected and it is possible to evaluate the performance of the agent, the process of continuous improvement can begin.
Continuous Improvement. Continuous monitoring of the model’s performance and collection of feedback are essential steps in the process. Feedback should be used for further fine-tuning and updates. It is also crucial to ensure that the model does not perpetuate biases or unethical outcomes. This necessitates a careful examination of the training data, regular checks for biases in outputs, and, if needed, training the model to recognize and avoid biases. Once the model achieves satisfactory performance, it can be deployed for the intended application. Continuous monitoring remains vital to ensure that the model performs as expected and to facilitate necessary adjustments. More details on this process, sources of training data, and details surrounding continous learning for agent AI can be found in Section 8.
3.3 代理 Transformer 的创建
如上图 5 所示,我们可以采用新的代理范式,通过 LLM 和 VLM 启动的代理,以及利用大型基础模型生成的数据,来训练代理 Transformer 模型,使其学会执行特定目标。在这一过程中,代理模型将被训练得更为专业化、针对特定任务和领域进行定制。这种方法可以让你利用已有基础模型中学到的特征和知识。下面我们用两个步骤对这一过程做一个简化概述:
1. 在领域内定义目标 为了训练代理 Transformer,必须在每个具体环境的背景下,明确定义代理的目标和其动作空间。这包括确定代理需要执行哪些具体任务或动作,并为每个任务分配独特的 agent tokens。此外,任何可以用来识别任务成功完成的自动规则或程序,都能显著提高训练数据的数量。否则,就需要依赖基础模型生成的数据或人工标注的数据来训练模型。一旦数据收集完毕,并且能够评估代理的表现,持续改进的过程便可启动。
2. 持续改进 持续监控模型的表现和收集反馈是整个过程中的关键步骤。反馈应被用于进一步微调和更新模型。同时,还必须确保模型不会固化偏见或产生不道德的结果,这就需要对训练数据进行仔细审查,定期检查输出中的偏见,并在必要时训练模型识别和规避这些偏见。一旦模型达到令人满意的表现,就可以部署到预期应用中。然而,持续监控仍然至关重要,以确保模型始终按预期工作,并便于进行必要的调整。有关这一过程、训练数据来源以及代理 AI 持续学习的详细信息,请参见第 8 节。
4 Agent AI Learning(智能体人工智能学习体系)
4.1 Strategy and Mechanism(4.1 核心策略与运行机制 )
The strategy of interactive AI on different domains which extends the paradigm of calling large foundation models with a trained agent that actively seeks to collect user feedback, action information, useful knowledge for generation and interaction. Some times, the LLM/VLM models are not need to trained again, and we improve their performance by providing improved contextual prompts at test time for an agent. On the other hand, it always involves a knowl- edge/reasoning/commonsense/inference interactive modeling through a combination of triple systems - one performing knowledge retrieval from multi-model query, second performing interactive generation from the relevant agent, and last one the trained a new, informative self-supervised training or pre-training with reinforcement learning or imitation learning with improved way.
4.1.1 Reinforcement Learning (RL) (4.1.1 强化学习(RL)技术演进 )
There is a rich history of leveraging reinforcement learning (RL) to train interactive agents that exhibits intelligent behaviors. RL is a methodology to learn the optimal relationship between states and actions based on rewards (or penalties) received as a result of its actions. RL is a highly scalable framework that has been applied to numerous applications including robotics, however, it generally faces several leader-board and LLM/VLMs have shown their potential to mitigate or overcome some of those difficulties:
● Reward designing The efficiency of policy learning greatly depends on the design of the reward function. Designing the reward function requires not only knowledge of RL algorithms but also a deep understanding of the nature of the task, and thus often necessitates crafting the function based on expert experience. Several studies explored the use of LLM/VLMs for designing reward functions (Yu et al., 2023a; Katara et al., 2023; Ma et al., 2023).
● Data collection and efficiency Given its exploratory nature, RL-based policy learning requires a significant amount of data (Padalkar et al., 2023). The necessity for extensive data becomes particularly evident when the policy involves managing long sequences or integrating complex actions. This is because these scenarios demand more nuanced decision-making and learning from a wider range of situations. In recent studies, efforts have been directed towards enhancing data generation to support policy learning (Kumar et al., 2023; Du et al., 2023). Additionally, in some studies, these models have been integrated into the reward function to improve policy learning (Sontakke et al., 2023). Parallel to these developments, another strand of research has focused on achieving parameter efficiency in learning processes using VLMs (Tang et al., 2023; Li et al., 2023d) and LLMs (Shi et al., 2023)
● Long-horizon steps In relation to the issue of data efficiency, RL becomes more challenging as the length of action sequences increases. This is due to the ambiguity in the relationship between actions and rewards, known as the credit assignment problem, and the increase in the number of states to be explored, necessitating a significant amount of time and data. One typical approach for long and complex tasks is to break them down into a sequence of subgoals and apply pretrained policies to solve each subgoal (e.g., (Takamatsu et al., 2022)). This idea falls within the framework called the task and motion planning (TAMP)(Garrett et al., 2021). TAMP is composed of two primary components: task planning, which entails identifying sequences of high-level
actions, and motion planning, which involves finding physically consistent, collision-free trajectories to achieve the objectives of the task plan.
LLMs are well-suited to TAMP, and recent research has often adopted an approach where LLMs are used to execute high-level task planning, while low-level controls are addressed with RL-based policies (Xu et al., 2023; Sun et al., 2023a; Li et al., 2023b; Parakh et al., 2023). The advanced capabilities of LLMs enable them to effectively decompose even abstract instructions into subgoals (Wake et al., 2023c), contributing to the enhancement of language understanding abilities in robotic systems.
4 智能体人工智能学习体系
4.1 核心策略与运行机制
交互式人工智能在不同领域的发展策略,本质上是构建一种新型范式:通过训练智能体主动收集用户反馈、动作信息及知识要素,与大型基础模型形成协同增强系统。这一范式具有两大技术路径:
其一,**模型直接调用优化**。多数情况下无需对LLM/VLM进行重新训练,而是通过测试阶段为智能体设计更优质的上下文提示(contextual prompts),即可显著提升模型表现。例如通过结构化提示工程引导模型输出更符合场景需求的响应。
其二,**交互式建模增强**。通过构建三元协同系统实现知识推理能力的持续进化:
- **多模态知识检索器**:从跨模态数据中提取关联信息(如视觉-语言联合检索)
- **交互式生成器**:基于检索内容进行动态内容生成(如对话、决策建议)
- **自监督训练器**:采用强化学习或模仿学习策略,通过环境反馈持续优化模型参数
这种架构实现了常识推理与专业知识的动态融合,使得智能体在复杂场景中展现出类人的认知迭代能力。
4.1.1 强化学习(RL)技术演进
强化学习作为训练智能决策系统的经典方法,通过"状态-动作-奖励"的循环机制寻找最优策略。尽管RL在机器人控制等领域取得显著成果,但传统方法存在三大核心挑战,而LLM/VLM的出现带来了突破性解决方案:
**▌ 奖励函数设计困境**
策略学习效率高度依赖奖励函数的设计,这不仅需要RL算法知识,更需对任务本质的深刻理解。最新研究表明:
- LLM能自动生成符合人类意图的奖励函数(Yu et al., 2023a)
- VLM通过视觉语义理解辅助设计多模态奖励(Ma et al., 2023)
- 基于语言模型的奖励塑形技术显著降低专家依赖度(Katara et al., 2023)
**▌ 数据饥渴与效率瓶颈**
由于探索性学习特性,传统RL需要海量训练数据(Padalkar et al., 2023),尤其在处理长序列决策时更为明显。前沿解决方案包括:
- 使用VLM生成合成训练数据(Tang et al., 2023)
- 构建LLM驱动的数据增强管道(Shi et al., 2023)
- 将多模态模型嵌入奖励函数提升样本利用率(Sontakke et al., 2023)
**▌ 长时程决策难题**
随着决策链延长,传统RL面临两大困境:
1. **信用分配问题**:难以确定具体动作对最终奖励的贡献度
2. **状态空间爆炸**:需要探索的潜在状态呈指数级增长
突破性方案是**任务与运动规划(TAMP)框架**(Garrett et al., 2021),该框架将复杂任务分解为:
- **高层任务规划**:由LLM将抽象指令解析为子目标序列(Wake et al., 2023c)
- **底层运动控制**:通过RL策略实现物理空间精确操作(Xu et al., 2023)
这种分层架构充分发挥了语言模型的语义理解优势与RL的精细控制能力。例如在机器人操作场景中,LLM可将"准备早餐"指令分解为"打开冰箱→取出食材→操作烤箱"等子步骤,再由RL控制器完成每个步骤的具体动作轨迹规划(Sun et al., 2023a)。这种协同范式正在推动具身智能向更高层级的认知能力进化。
翻译中文 科普风格 括号内保留英文
4.1.2 Imitation Learning (IL: 模仿学习)
While RL aims to train a policy based on exploratory behavior and maximizing rewards through interactions with the environment, imitation learning (IL) seeks to leverage expert data to mimic the actions of experienced agents or experts. For example, in robotics, one of the major frameworks based on IL is Behavioral Cloning (BC). BC is an approach where a robot is trained to mimic the actions of an expert by directly copying them. In this approach, the expert’s actions in performing specific tasks are recorded, and the robot is trained to replicate these actions in similar situations. Recent BC-based methods often incorporate technologies from LLM/VLMs, enabling more advanced end-to-end models. For example, Brohan et al. proposed RT-1 (Brohan et al., 2022) and RT-2 (Brohan et al., 2023), transformer-based models that output an action sequence for the base and arm, taking a series of images and language as input. These models are reported to show high generalization performance as the result of training on a large amount of training data.
与强化学习 (Reinforcement Learning, RL) 旨在通过探索性行为以及与环境互动来最大化奖励,从而训练策略不同,模仿学习 (Imitation Learning, IL) 则力求运用专家数据,复刻经验丰富的智能体或专家的行为。例如,在机器人技术领域,一个基于模仿学习 (IL) 的主要框架就是行为克隆 (Behavioral Cloning, BC)。行为克隆 (BC) 是一种训练机器人模仿专家行为的方法,其核心思想是直接复制专家的动作。在这种方法中,首先会记录专家在执行特定任务时的动作,然后训练机器人,使其在类似情境下能够复现这些动作。近期,基于行为克隆 (BC) 的方法常常融合来自大型语言模型 (Large Language Models, LLM) / 大型视觉模型 (Large Vision Models, VLM) 的技术,从而实现更先进的端到端模型。例如,Brohan 等人提出了 RT-1 (Brohan et al., 2022) 和 RT-2 (Brohan et al., 2023),这两种模型都是基于 Transformer 架构,能够以一系列图像和语言为输入,输出机械臂和底座的动作序列。报告显示,由于使用了大量的训练数据进行训练,这些模型展现出了很高的泛化性能。
4.1.3 Traditional RGB(传统RGB视觉技术)
Learning intelligent agent behavior leveraging image inputs has been of interest for many years (Mnih et al., 2015). The inherent challenge of using RGB input is the curse of dimensionality. To solve this problem, researchers either use more data (Jang et al., 2022; Ha et al., 2023) or introduce inductive biases into the model design to improve sample efficiency. In particular, authors incorporate 3D structures into the model architecture for manipulations (Zeng et al., 2021; Shridhar et al., 2023; Goyal et al., 2023; James and Davison, 2022). For robot navigation, authors (Chaplot et al., 2020a,b) leverage maps as a representation. Maps can either be learned from a neural network aggregating all previous RGB inputs or through 3D reconstruction methods such as Neural Radiance Fields (Rosinol et al., 2022).
To obtain more data, researchers synthesize synthetic data using graphics simulators (Mu et al., 2021; Gong et al., 2023b), and try to close the sim2real gap (Tobin et al., 2017; Sadeghi and Levine, 2016; Peng et al., 2018). Recently, there has been some collective effort to curate large-scale dataset that aims to resolve the data scarcity problem (Padalkar et al., 2023; Brohan et al., 2023). On the other hand, to improve sample complexity, data augmentation techniques have been extensively studied as well (Zeng et al., 2021; Rao et al., 2020; Haarnoja et al., 2023; Lifshitz et al., 2023).
4.1.3 传统RGB视觉技术
基于RGB图像输入训练智能体行为的研究已有多年探索(Mnih et al., 2015)。使用RGB数据的核心挑战在于**维度灾难**——高维像素空间带来的学习复杂度激增。为应对这一难题,研究者主要采取两大技术路线:
**▌ 数据规模扩展策略**
- 通过图形仿真器生成合成数据(Mu et al., 2021),并持续改进仿真到现实的迁移效果(sim2real gap)(Tobin et al., 2017)
- 构建大规模真实世界数据集(Padalkar et al., 2023),如机器人操作领域标杆数据集RT-1(Brohan et al., 2023)
**▌ 模型架构创新路径**
- **3D结构先验嵌入**:在机械臂操作任务中,将三维空间表征融入网络架构(Zeng et al., 2021),提升对物体位姿的理解能力(James and Davison, 2022)
- **动态地图构建**:在机器人导航领域,通过累积RGB帧构建环境地图(Chaplot et al., 2020a)。地图生成方式包括:
- 端到端神经网络实时建图(Shridhar et al., 2023)
- 基于神经辐射场(NeRF)的3D场景重建(Rosinol et al., 2022)
- **数据增强优化**:采用时序增强(Haarnoja et al., 2023)、多视角融合(Lifshitz et al., 2023)等技术提升样本利用率,突破传统数据增强方法的局限性(Rao et al., 2020)
值得关注的是,近期研究通过物理仿真引擎(Gong et al., 2023b)生成逼真交互数据,结合域随机化技术(Peng et al., 2018),显著提升了视觉策略在真实场景中的泛化能力。这种虚实结合的训练范式,正在突破传统纯视觉方法在复杂动态环境中的应用瓶颈。
4.1.4 In-context Learning(上下文学习)
In-context learning was shown to be an effective method for solving tasks in NLP with the advent of large language models like GPT-3 (Brown et al., 2020; Min et al., 2022). Few-shot prompts were seen to be an effective way to contextualize model output’s across a variety of tasks in NLP by providing examples of the task within the context of the LLM prompt. Factors like the diversity of examples and quality of examples shown for the in-context demonstrations may improve the quality of model outputs (An et al., 2023; Dong et al., 2022). Within the context of multi-modal foundation models, models like Flamingo and BLIP-2 (Alayrac et al., 2022; Li et al., 2023c) have been shown to be effective at a variety of visual understanding tasks when given only given a small number of examples. In context learning can be further improved for agents within environments by incorporating environment-specific feedback when certain actions are taken (Gong et al., 2023a).
4.1.4 上下文学习(In-context Learning)
随着大语言模型(如 GPT-3)的发展(Brown et al., 2020; Min et al., 2022),上下文学习(In-context Learning) 被证明是一种有效的自然语言处理(NLP)任务解决方法。少样本提示(Few-shot Prompting) 通过在大语言模型(LLM)提示中提供任务示例,使模型能够更好地理解并生成符合上下文的输出,在各种 NLP 任务中表现出了良好的效果。
在上下文学习过程中,示例的多样性和质量 可能会影响模型输出的质量(An et al., 2023; Dong et al., 2022)。在多模态基础模型(Multi-modal Foundation Models)领域,Flamingo 和 BLIP-2 等模型(Alayrac et al., 2022; Li et al., 2023c)已被证明,在提供少量示例的情况下,依然能高效地执行各种视觉理解任务。
此外,在具体环境中的智能体(Agent)可以通过结合环境特定的反馈(Environment-specific Feedback)来进一步提升上下文学习能力,例如在执行某些动作后,根据环境反馈调整学习策略(Gong et al., 2023a)。
4.1.5 Optimization in the Agent System(智能体系统中的优化)
The optimization of agent systems can be divided into spatial and temporal aspects. Spatial optimization considers how agents operate within a physical space to execute tasks. This includes inter-robot coordination, resource allocation, and keeping an organized space.
In order to effectively optimize agent AI systems, especially systems with large numbers of agents acting in parallel, previous works have focused on using large batch reinforcement learning (Shacklett et al., 2023). Since datasets of multi-agent interactions for specific tasks are rare, self-play reinforcement learning enables a team of agents to improve over time. However, this may also lead to very brittle agents that can only work under self-play and not with humans or other independent agents since they over-fit to the self-play training paradigm. To address this issue, we can instead discover a diverse set of conventions (Cui et al., 2023; Sarkar et al., 2023), and train an agent that is aware of a wide range of conventions. Foundation models can further help to establish conventions with humans or other independent agents, enabling smooth coordination with new agents.
Temporal optimization, on the other hand, focuses on how agents execute tasks over time. This encompasses task scheduling, sequencing, and timeline efficiency. For instance, optimizing the trajectory of a robot’s arm is an example of efficiently optimizing movement between consecutive tasks (Zhou et al., 2023c). At the level of task scheduling, methods like LLM-DP (Dagan et al., 2023) and ReAct (Yao et al., 2023a) have been proposed to solve efficient task planning by incorporating environmental factors interactively.
4.1.5 智能体系统中的优化
智能体系统的优化可以分为空间和时间两个方面。空间优化 主要关注智能体如何在物理空间中执行任务,这包括机器人之间的协调、资源分配以及维持空间的整洁。
为了有效地优化智能体 AI 系统,尤其是涉及大量智能体并行行动的系统,早期的研究主要集中于使用大批量强化学习 (Shacklett et al., 2023)。由于针对特定任务的多智能体交互数据集较为稀缺,自我博弈强化学习使得一组智能体能够随着时间不断提升性能。但这种方法也可能导致智能体过于脆弱,仅适用于自我博弈环境,而难以与人类或其他独立智能体协同工作,因为它们过度适应了自我博弈训练模式。为了解决这一问题,我们可以尝试发掘多样化的协作规则 (Cui et al., 2023; Sarkar et al., 2023),并训练出能够识别多种协作规则的智能体。基础模型还能进一步帮助建立与人类或其他独立智能体之间的协作规则,从而实现与新智能体的顺畅配合。
时间优化 则侧重于智能体如何在时间维度上高效执行任务,这涵盖了任务调度、任务顺序安排和时间线效率的提升。例如,优化机器人手臂的运动轨迹就是在连续任务之间高效调动运动能力的一个典型案例 (Zhou et al., 2023c)。在任务调度层面上,像 LLM-DP (Dagan et al., 2023) 和 ReAct (Yao et al., 2023a) 等方法被提出,通过交互式地融入环境因素,来解决高效任务规划的问题。
4.2 Agent Systems (zero-shot and few-shot level) (智能体系统)
4.2.1 Agent Modules (智能体模块)
Our foray into the agent paradigm involves the development of Agent AI "Modules" for interactive multi-modal agents using LLMs or VLMs. Our initial Agent Modules facilitate training or in-context learning and adopt a minimalist design for the purposes of demonstrating the agent’s ability to schedule and coordinate effectively. We also explored initial prompt-based memory techniques that facilitate better planning and inform future actions approaches within the domain. To illustrate, our “MindAgent" infrastructure comprises 5 main modules: 1) environment perception with task planning, 2) agent learning, 3) memory, 4) general agent action prediction and 5) cognition, as shown in Figure 5.
4.2.2 Agent Infrastructure (智能体基础设施)
Agent-based AI is a large and fast-growing community within the domains of entertainment, research, and industry. The development of large foundation models has significantly improved the performance of agent AI systems. However, creating agents in this vein is limited by the increasing effort necessary to create high-quality datasets and overall cost. At Microsoft, building high-quality agent infrastructure has significantly impacted multi-modal agent copilots by using advanced hardware, diverse data sources, and powerful software libraries. As Microsoft continues to push the boundaries of agent technology, AI agent platforms are poised to remain a dominant force in the world of multimodal intelligence for years to come. Nevertheless, agent AI interaction is currently still a complex process that requires a combination of multiple skills. The recent advancements in the space of large generative AI models have the potential to greatly reduce the current high cost and time required for interactive content, both for large studios, as well as empowering smaller independent content creators to design high quality experiences beyond what they are currently capable of. The current human-machine interaction systems inside multi-modal agents are primarily rule-based. They do have intelligent behaviors in response to human/user actions and possess web knowledge to some extent. However, these interactions are often limited by software development costs to enable specific behaviors in the system. In addition, current models are not designed to help human to achieve a goal in the case of users’ inability to achieve specific tasks. Therefore, there is a need for an agent AI system infrastructure to analyze users behaviors and provide proper support when needed.
4.2 智能体系统 (zero-shot and few-shot level)
4.2.1 智能体模块
我们对智能体范式的探索涉及使用 LLMs 或 VLMs 开发交互式多模态智能体的 Agent AI "Modules"。我们最初设计的智能体模块旨在促进训练或 in-context learning,并采用极简设计,以展示智能体在任务调度与协调方面的高效能力。我们还探索了基于提示的初步记忆技术,这些技术有助于更好地规划和指导未来行动策略。 例如,我们的 "MindAgent" 基础设施由五个主要模块组成:1) environment perception with task planning,2) agent learning,3) memory,4) general agent action prediction 以及 5) cognition,如 Figure 5 所示。
4.2.2 智能体基础设施
基于智能体的 AI 是一个涵盖娱乐、研究和工业等多个领域的庞大且快速发展的社区。大型基础模型的发展显著提升了智能体 AI 系统的性能。然而,在这一领域创建智能体面临着高质量数据集制作投入不断增加及总体成本较高的问题。 在 Microsoft,通过采用先进硬件、多样数据源和强大软件库构建高质量的智能体基础设施,已对多模态智能体 copilots 产生了显著影响。随着 Microsoft 不断推动智能体技术的边界,AI 智能体平台有望在未来数年内继续在多模态智能领域中保持主导地位。 不过,目前智能体 AI 的交互仍然是一个复杂的过程,需要多种技能的结合。近期大型生成式 AI 模型领域的进展,有望大幅降低现阶段为制作交互式内容而需投入的高成本和长周期,无论是对大型工作室,还是帮助较小的独立内容创作者设计出超越现有能力的高质量体验。 目前,多模态智能体中的人机交互系统主要是基于规则的,虽然它们能在响应 human/user actions 时展现一定的智能行为,并在一定程度上具备 web knowledge,但这些交互往往受限于为实现特定行为而产生的软件开发成本。此外,目前的模型并非为在用户无法完成特定任务时协助人类达成目标而设计。因此,亟需构建一套智能体 AI 系统基础设施,用以分析用户行为,并在必要时提供恰当支持。
4.3 Agentic Foundation Models (pretraining and finetune level) (具备智能体能力的基础模型 (预训练与微调层面))
The use of pre-trained foundation models offers a significant advantage in their wide applicability across diverse use cases. The integration of these models enables the development of customized solutions for various applications, circumventing the need for extensive labeled datasets for each specific task.
A notable example in the field of navigation is the LM-Nav system (Shah et al., 2023a), which incorporates GPT-3 and CLIP in a novel approach. It effectively uses textual landmarks generated by the language model, anchoring them in images acquired by robots for navigation. This method demonstrates a seamless fusion of textual and visual data, significantly enhancing the capabilities of robotic navigation, while maintaining wide applicability.
In robot manipulation, several studies have proposed the use of off-the-shelf LLMs (e.g., ChatGPT) while using open vocabulary object detectors. The combination of LLM and advanced object detectors (e.g., Detic (Zhou et al., 2022)) fa- cilitates the understanding of human instruction while grounding the textual information in scenery information (Parakh
et al., 2023). Furthermore, the latest advancements showcase the potential of using prompt engineering with advanced multi-modal models such as GPT-4V(ision) (Wake et al., 2023b). This technique opens avenues for multi-modal task planning, underscoring the versatility and adaptability of pre-trained models in a variety of contexts.
4.3 具备智能体能力的基础模型 (预训练与微调层面)
预训练基础模型的使用,因其在各种应用场景中的广泛适用性而具有显著优势。 集成这些模型能够为各种应用开发定制化的解决方案,从而 避免了 (circumventing) 针对每个特定任务都需构建大量标注数据集的需求。
在导航领域,一个值得注意的例子是 LM-Nav 系统 (LM-Nav system) (Shah et al., 2023a)。 该系统以一种新颖的方式融合了 GPT-3 和 CLIP。 它有效地利用了语言模型生成的文本地标 (textual landmarks),并将这些地标锚定在机器人获取的图像中,从而实现导航。 这种方法展示了文本和视觉数据的 无缝融合 (seamless fusion),在保持广泛适用性的同时,显著增强了机器人导航的能力。
在机器人操作 (robot manipulation) 领域,一些研究已经提出了使用 现成的 (off-the-shelf) 大型语言模型 (LLM),例如 ChatGPT,并结合 开放词汇目标检测器 (open vocabulary object detectors)。 LLM 和先进的目标检测器 (例如 Detic (Zhou et al., 2022)) 的结合,有助于理解人类指令,同时将文本信息 扎根于 (grounding) 场景信息之中 (Parakh et al., 2023)。 此外,最新的进展展示了将 提示工程 (prompt engineering) 与先进的多模态模型(例如 GPT-4V(ision) (Wake et al., 2023b))结合使用的潜力。 这种技术为 多模态任务规划 (multi-modal task planning) 开辟了道路,突显了预训练模型在各种环境中的通用性和适应性。
5 Agent AI Categorization(智能体AI分类)
5.1 Generalist Agent Areas(通用智能体领域 )
Computer-based action and generalist agents (GAs) are useful for many tasks. Recent progress in the field of large foundation models and interactive AI has enabled new functionalities for GAs. However, for a GA to become truly valuable to its users, it must be natural to interact with, and generalize to a broad range of contexts and modalities. We high-quality extended main Chapters on Agent foundation AI in Sec.6, especially in areas relevant to the themes in general of these topics:
Multimodal Agent AI (MMA) is an upcoming forum2 for our research and industry communities to engage with each other and with the broader research and technology communities in Agent AI. Recent progress in the field of large foundation models and interactive AI has enabled new functionalities for generalist agents (GAs), such as predicting user actions and task planning in constrained settings (e.g., MindAgent (Gong et al., 2023a), fine-grained multimodal video understanding (Luo et al., 2022), Robotics (Ahn et al., 2022b; Brohan et al., 2023)), or providing a chat companion for users that incorporates knowledge feedback (e.g., website customer support for healthcare systems (Peng et al., 2023)). More details about the representative works and most recent representative works are shown below. We hope to discuss our vision for the future of MAA and inspire future researchers to work in this space. This article and our forum covers the following main topics, but is not limited exclusively to these:
• Primary Subject Topics: Multimodal Agent AI, General Agent AI
• Secondary Subject Topics: Embodied Agents, Action Agents, Language-based Agents, Vision & Language Agents, Knowledge and Inference Agents, Agents for Gaming, Robotics, Healthcare, etc.
• Extend Subject Topics: Visual Navigation, Simulation Environments, Rearrangement, Agentic Foundation Models, VR/AR/MR, Embodied Vision & Language.
Next, we present a specific lists of representative agent categories as follows:
### 5 智能体AI分类
#### 5.1 通用智能体领域
基于计算机的动作和通用智能体(Generalist Agents, GAs)在许多任务中都非常有用。近年来,大型基础模型(Large Foundation Models)和交互式AI领域的进展为GAs带来了新的功能。然而,为了让一个GA真正对用户有价值,它必须具备自然的交互能力,并能够广泛适用于多种上下文和模态(contexts and modalities)。我们在第6节中详细扩展了关于智能体基础AI的主要章节内容,特别是在与这些主题相关的领域:
**多模态智能体AI(Multimodal Agent AI, MMA)** 是我们研究和产业社区即将举办的一个论坛(forum2),旨在促进研究和产业社区之间以及更广泛的研究和技术社区在智能体AI领域的互动交流。近期,大型基础模型和交互式AI领域的进展为通用智能体(GAs)带来了新的功能,例如在受限环境中预测用户行为和任务规划(如MindAgent (Gong et al., 2023a))、细粒度多模态视频理解(Luo et al., 2022)、机器人技术(Ahn et al., 2022b; Brohan et al., 2023),或者为用户提供结合知识反馈的聊天伴侣(如医疗系统中的网站客户服务 (Peng et al., 2023))。以下是关于代表性工作和最新代表性工作的更多细节。我们希望通过本文讨论对未来多模态智能体AI(MMA)的愿景,并激励未来的研究人员投身这一领域。本文及我们的论坛涵盖以下主要主题,但不仅限于此:
- **主要主题**:多模态智能体AI(Multimodal Agent AI)、通用智能体AI(General Agent AI)
- **次要主题**:具身智能体(Embodied Agents)、动作智能体(Action Agents)、语言基础智能体(Language-based Agents)、视觉与语言智能体(Vision & Language Agents)、知识与推理智能体(Knowledge and Inference Agents)、游戏智能体(Agents for Gaming)、机器人智能体(Robotics)、医疗智能体(Healthcare)等
- **扩展主题**:视觉导航(Visual Navigation)、仿真环境(Simulation Environments)、重排任务(Rearrangement)、智能体基础模型(Agentic Foundation Models)、虚拟现实/增强现实/混合现实(VR/AR/MR)、具身视觉与语言(Embodied Vision & Language)
接下来,我们将列出一些具体的代表性智能体类别如下:
5.2 Embodied Agents(具身智能体)
Our biological minds live in bodies, and our bodies move through a changing world. The goal of embodied artificial intelligence is to create agents, such as robots, which learn to creatively solve challenging tasks requiring interaction with the environment. While this is a significant challenge, important advances in deep learning and the increasing availability of large datasets like ImageNet have enabled superhuman performance on a variety of AI tasks previously thought intractable. Computer vision, speech recognition and natural language processing have experienced transformative revolutions at passive input-output tasks like language translation and image classification, and reinforcement learning has similarly achieved world-class performance at interactive tasks like game playing. These advances have supercharged embodied AI, enabling a growing collection of users to make rapid progress towards intelligent agents can interactive with machine.
5.2.1 Action Agents(行动智能体)
Action agents refer to the agents that need to execute physical actions in the simulated physical environment or real world. In particular, they need to be actively engaging in activities with the environment. We broadly classify action agents into two different categories based on their application domains: gaming AI and robotics.
In gaming AI, the agents will interact with the game environment and other independent entities. In these settings, natural language can enable smooth communication between agents and humans. Depending on the game, there may be a specific task to accomplish, providing a true reward signal. For instance, in the competitive Diplomacy game, training
注: 2 Current URL: https://multimodalagentai.github.io/
a language model using human conversation data along with an action policy with RL enables human-level play (Meta Fundamental AI Research (FAIR) Diplomacy Team et al., 2022).
There are also settings where we agents act as normal residents in a town (Park et al., 2023a), without trying to optimize a specific goal. Foundation models are useful in these settings because they can model interactions that appear more natural by mimicking human behavior. When augmented with external memory, they produce convincing agents that can have conversations, daily schedules, form relationships, and have a virtual life.
5.2.2 Interactive Agents (交互智能体)
Interactive agents simply refer to agents that can interact with the world, a broader class of agents than action agents. Their forms of interaction do not necessarily require physical actions, but may involve communicating information to users or modifying the environment. For instance, an embodied interactive agent may answer a user’s questions about a topic through dialogue or help users parse through existing information similar to a chatbot. By extending an agent’s capabilities to include information sharing, the core designs and algorithms of Agent AI can be effectively adapted for a range of applications, such as diagnostic (Lee et al., 2023) and knowledge-retrieval (Peng et al., 2023) agents.
5.2 具身智能体(Embodied Agents)
人类生物大脑依托躯体存在,躯体则在动态世界中穿行。具身人工智能(Embodied AI)的目标是创造能通过环境交互学习解决复杂任务的智能体(如机器人)。尽管面临巨大挑战,但深度学习的重要突破与ImageNet等大型数据集的普及,已让AI在诸多曾被视作无解的领域展现超人类水平。计算机视觉、语音识别和自然语言处理已在翻译、图像分类等被动输入输出任务中实现变革,强化学习(Reinforcement Learning)同样在游戏竞技等交互任务中达到世界级水平。这些突破极大推动了具身AI发展,使越来越多的开发者能快速构建可与机器交互的智能体。
5.2.1 行动智能体(Action Agents)
行动智能体指需在仿真物理环境或现实世界中执行实体动作的智能体,其核心特征是与环境保持主动互动。根据应用领域,我们将其分为两大方向:游戏AI与机器人技术(Robotics)。
在游戏AI中,智能体需与游戏环境及其他独立实体交互。此类场景中,自然语言能实现人机流畅沟通。部分游戏设有需完成的特定任务,可提供明确的奖励信号。例如在策略游戏《Diplomacy》中,结合人类对话数据训练语言模型,并通过强化学习制定行动策略,可使AI达到人类玩家水平(Meta Fundamental AI Research (FAIR) Diplomacy Team et al., 2022)。
另有些场景则让智能体扮演虚拟城镇的普通居民(Park et al., 2023a),无需追求特定目标。大模型(Foundation Models)在此类场景中优势显著——通过模仿人类行为,其交互表现更趋自然。当结合外部记忆增强时,这些智能体可展现出令人信服的虚拟生活:日常对话、行程安排、社交关系构建等。
5.2.2 交互智能体(Interactive Agents)
交互智能体泛指可与世界互动的智能体,其范畴广于行动智能体。它们的交互形式不限于实体动作,可能涉及信息传递或环境修改。例如,具身交互智能体可通过对话回答用户问题(类似聊天机器人),或协助用户解析既有信息。通过赋予智能体信息共享能力,Agent AI的核心设计与算法可有效适配诊断型(Lee et al., 2023)、知识检索型(Peng et al., 2023)等多种应用场景。
注2:当前论坛地址:https://multimodalagentai.github.io/
5.3 Simulation and Environments Agents (仿真与环境智能体)
An effective approach for AI agents to learn how to act in an environment is to go through trial-and-error experiences via interactions with the environment. A representative method is RL, which requires extensive experience of failures to train an agent. Although there exist approaches that use physical agents (Kalashnikov et al., 2018), using physical agents is time-consuming and costly. Furthermore, training in the physical environment is often feasible when failure in actual environments can be dangerous (e.g., autonomous driving, underwater vehicles). Hence, using simulators to learn policies is a common approach.
Many simulation platforms have been proposed for research in embodied AI, ranging from navigation (Tsoi et al., 2022; Deitke et al., 2020; Kolve et al., 2017) to object manipulation (Wang et al., 2023d; Mees et al., 2022; Yang et al., 2023a; Ehsani et al., 2021). One example is Habitat (Savva et al., 2019; Szot et al., 2021), which provides a 3D indoor environment where human- and robotic-agents can perform various tasks such as navigation, instruction following, and question answering. Another representative simulation platform is VirtualHome (Puig et al., 2018), supporting human avatars for object manipulation in 3D indoor environments. In the field of gaming, Carroll et al. have introduced "Overcooked-AI," a benchmark environment designed to study cooperative tasks between humans and AI (Carroll et al., 2019). Along similar lines, several works aim to incorporate real human intervention beyond the focus of interaction between agents and the environment (Puig et al., 2023; Li et al., 2021a; Srivastava et al., 2022). These simulators contribute to the learning of policies in practical settings involving agent and robot interactions, and IL-based policy learning utilizing human demonstrative actions.
In certain scenarios, the process of learning a policy may necessitate the integration of specialized features within simulators. For example, in the case of learning image-based policies, realistic rendering is often required to facilitate adaptability to real environments (Mittal et al., 2023; Zhong et al., 2023). Utilizing a realistic rendering engine is effective for generating images that reflect various conditions, such as lighting environments. Moreover, simulators employing physics engines are required to simulate physical interactions with objects (Liu and Negrut, 2021). The integration of physics engines in simulation has been shown to facilitate the acquisition of skills that are applicable in real-world scenarios (Saito et al., 2023).
5.3 仿真与环境智能体(Simulation and Environments Agents)
AI智能体学习环境交互的最佳路径之一,是通过试错机制积累经验。典型方法如强化学习(RL)就需通过大量失败训练来培养智能体。虽然存在使用实体机器人(physical agents)进行训练的方法(Kalashnikov等,2018),但其耗时耗力成本高昂。更关键的是,在自动驾驶、水下机器人等可能引发真实危险的应用中,物理环境训练往往不可行。因此,借助仿真平台(simulators)学习行为策略成为主流选择。
当前具身AI领域已涌现众多仿真平台:从视觉导航(navigation)(Tsoi等,2022;Deitke等,2020;Kolve等,2017)到物体操控(object manipulation)(Wang等,2023d;Mees等,2022;Yang等,2023a;Ehsani等,2021)均有覆盖。例如Habitat(Savva等,2019;Szot等,2021)构建了3D室内环境,支持人形/机器人智能体执行导航、指令跟随、问答等任务;VirtualHome(Puig等,2018)则专注于3D室内场景中的物体操控训练。游戏领域,Carroll等开发的《Overcooked-AI》(Carroll等,2019)成为研究人机协作的标杆环境。近年研究更突破智能体-环境交互的局限,尝试融入真实人类干预(Puig等,2023;Li等,2021a;Srivastava等,2022)。这些仿真系统不仅助力智能体/机器人交互策略学习,还可通过模仿学习(IL)获取人类示范动作。
特定场景的策略学习需集成仿真器专项功能:例如基于视觉的策略训练常需真实感渲染(realistic rendering)技术,以提升现实环境适应性(Mittal等,2023;Zhong等,2023)。高级渲染引擎能生成不同光照条件下的逼真图像;而搭载物理引擎(physics engines)的仿真器则能模拟物体间的力学交互(Liu和Negrut,2021)。研究表明,物理仿真训练获得的技能可有效迁移至现实场景(Saito等,2023)。
5.4 Generative Agents (生成式智能体)
The recent advancements in the space of large generative AI models have the potential to greatly reduce the current high cost and time required for interactive content, both for large gaming studios, as well as empower smaller independent studios to create high quality experiences beyond what they are currently capable of. Additionally, embedding large AI models within a sandbox environment will allow users to author their own experiences and express their creativity in ways that are currently out of reach.
The goals of this agent go beyond simply adding interactive 3d content to scenes, but also include:
• Adding arbitrary behavior and rules of interactions to the objects, allowing the user to create their own VR rules with minimal prompting.
• Generating whole level geometry from a sketch on a piece of paper, by using the multimodal GPT4-v model, as well as other chains of models involving vision AI models
• Retexturing content in scenes using diffusion models
• Creating custom shaders and visual special effects from simple user prompts
One potential application in the short term is the VR creation of a storyboarding/prototype tool allowing a single user to create a rough (but functional) sketch of an experience/game an order of magnitude faster than currently feasible. Such a prototype then could be expanded and made more polished using these tools as well.
### 5.4 生成式智能体(Generative Agents)
近年来,大型生成式AI模型领域的技术进步展现出巨大潜力,不仅能大幅降低制作交互内容所需的高成本和时间,还能够让大型游戏工作室和小型独立开发团队创造出超越当前能力的高质量体验。此外,将这些大型AI模型嵌入沙盒环境(sandbox environment),用户可以创作属于自己的内容,并以目前无法实现的方式表达创造力。
---
#### **生成式智能体的目标**
这类智能体的作用远不止于简单地为场景添加交互式的3D内容,还包括以下几个方面:
- **为对象添加任意行为和交互规则**
用户可以通过最少的提示,轻松创建自己的虚拟现实(VR)规则。
- **从纸上的草图生成完整关卡几何结构**
借助多模态GPT4-v模型以及其他涉及视觉AI模型的链条模型,可以从手绘草图快速生成整个关卡的几何结构。
- **使用扩散模型重新纹理化场景内容**
利用扩散模型对场景中的物体进行重新贴图,让画面更加逼真。
- **根据简单用户提示生成自定义着色器和视觉特效**
用户只需提供简单的指令,就能生成独特的着色器和视觉特效,为内容增添艺术感。
---
#### **短期潜在应用**
短期内,生成式智能体的一个重要应用场景是开发一种故事板/原型设计工具(storyboarding/prototype tool)。通过这种工具,单个用户能够以比现有方法快十倍的速度,创建出一个粗糙但功能完整的体验或游戏草图。随后,开发者还可以利用这些工具进一步扩展和完善原型,使其变得更加精致和完善。
---
### 科普总结
生成式智能体正在改变内容创作的规则。通过结合先进的AI技术和创意工具,它不仅能够大幅缩短开发周期,还能让用户以更自由、更高效的方式表达自己的创意。无论是从一张草图开始构建复杂的虚拟世界,还是根据简单提示生成精美的视觉效果,生成式智能体都为创作者提供了强大的支持,推动了虚拟现实和游戏开发领域迈向新的高度!
5.4.1 AR/VR/mixed-reality Agents (增强现实/虚拟现实/混合现实智能体)
AR/VR/mixed-reality (jointly referred to as XR) settings currently require skilled artists and animators to create characters, environments, and objects to be used to model interactions in virtual worlds. This is a costly process that involves concept art, 3D modeling, texturing, rigging, and animation. XR agents can assist in this process by facilitating interactions between creators and building tools to help build the final virtual environment.
Our early experiments have already demonstrated that GPT models can be used in the few-shot regime inside of the Unity engine (without any additional fine-tuning) to call engine-specific methods, use API calls to download 3d models from the internet and place them into the scene, and assign state trees of behavior and animations to them (Huang et al., 2023a). This behavior likely emerges due to the presence of similar code in open source game repositories that use Unity. Therefore, GPT models are capable of building rich visual scenes in terms of loading in many objects into the scene from a simple user prompt.
The aim of this category of agents is to build a platform and a set of tools that provide an efficient interface between large AI models (both GPT-family ones as well as diffusion image models) and a rendering engine. We explore two primary avenues here:
• Integration of large models into the various editor tools in the agent infrastructure, allowing for significant speedups in development.
•Controllingtherenderingenginefromwithinauserexperience,bygeneratingcodethatfollowsuserinstruction and then compiling it at runtime, allowing for users to potentially edit the VR/simulation they are interacting with in arbitrary ways, even by introducing new agent mechanics.
Introducing an AI copilot focused on XR settings would be useful for XR creators, who can use the copilot to complete tedious tasks, like providing simple assets or writing code boilerplate, freeing creators to focus on their creative vision and quickly iterate on ideas.
Furthermore, agents can help users interactively modify the environment by adding new assets, changing the dynamics of the environment, or building new settings. This form of dynamic generation during runtime can also be specified by a creator, enabling the user’s experience to feel fresh and continue evolving over time.
### 5.4.1 增强现实/虚拟现实/混合现实智能体(AR/VR/Mixed-Reality Agents)
当前,增强现实(AR)、虚拟现实(VR)和混合现实(MR,统称为XR)环境的构建需要熟练的艺术家和动画师来创建角色、环境和物体,以用于模拟虚拟世界中的交互。这一过程成本高昂,涉及概念设计、3D建模、纹理处理、骨骼绑定以及动画制作等多个步骤。XR智能体可以通过促进创作者之间的协作,并提供工具帮助构建最终的虚拟环境,从而协助这一复杂过程。
---
#### **早期实验成果**
我们的早期实验已经证明,GPT模型可以在Unity引擎中以少量样本(few-shot regime)的形式运行(无需额外微调),完成以下任务:调用引擎特定的方法、通过API从互联网下载3D模型并将其放置到场景中,以及为这些模型分配行为状态树和动画(Huang等人, 2023a)。这种能力可能源于Unity开源游戏仓库中存在类似的代码片段。因此,GPT模型能够根据简单的用户提示加载大量对象到场景中,从而构建丰富的视觉场景。
---
#### **XR智能体的目标**
这类智能体的主要目标是搭建一个平台,并开发一套工具,为大型AI模型(包括GPT系列模型以及扩散图像模型)与渲染引擎之间提供高效的接口。我们在此领域探索了两条主要路径:
1. **将大型模型集成到智能体基础设施的各种编辑工具中**
这种集成可以显著加速开发过程,使开发者能够更高效地完成任务。
2. **在用户体验内控制渲染引擎**
通过生成遵循用户指令的代码并在运行时编译,用户可以灵活地编辑他们正在交互的VR或模拟环境,甚至引入新的智能体机制。
---
#### **XR创作助手的价值**
引入专注于XR设置的AI辅助工具(copilot)对于XR创作者来说非常有用。创作者可以利用该工具完成繁琐的任务,例如生成简单的资产或编写代码模板,从而解放创作者的时间和精力,让他们专注于创意愿景,并快速迭代想法。
---
#### **动态环境生成**
此外,智能体还可以帮助用户通过添加新资产、改变环境动态或构建新场景来实时交互式修改环境。这种运行时的动态生成方式可以根据创作者的需求进行定制,使用户的体验始终保持新鲜感,并随着时间推移不断演进。
---
### 科普总结
AR/VR/MR领域的智能体研究正在推动虚拟内容创作的边界。通过结合大型AI模型和渲染引擎,这些智能体不仅能够自动化繁琐的任务,还能为创作者提供强大的工具支持,使他们能够更专注于创意本身。无论是简化资产生成还是实现实时环境修改,XR智能体都有望为虚拟现实技术的发展注入更多活力,让虚拟世界的构建变得更加高效和富有创意!
5.5 Knowledge and Logical Inference Agents (知识与逻辑推理智能体)
The capacity to infer and apply knowledge is a defining feature of human cognition, particularly evident in complex tasks such as logical deduction, and understanding theory of mind 3. Making inferences on knowledge ensures that the AI’s responses and actions are consistent with known facts and logical principles. This coherence is a crucial mechanism for maintaining trust and reliability in AI systems, especially in critical applications like medical diagnosis or legal analysis. Here, we introduce agents that incorporate the interplay between knowledge and inference that address specific facets of intelligence and reasoning.
3https://plato.stanford.edu/entries/cognitive-science
5.5.1 Knowledge Agent (知识智能体)
Agent AI:
Surveying the Horizons of Multimodal Interaction A PREPRINT
Knowledge Agents reason over their acquired knowledge systems in two directions: implicit and explicit. Implicit knowledge is typically what large-scale language models like the GPT series (Brown et al., 2020; OpenAI, 2023) encapsulate after being trained on vast amounts of text data. These models can generate responses that give the impression of understanding, as they draw on patterns and information implicitly learned during training. Explicit knowledge, conversely, is structured and can be directly queried, such as the information found in knowledge bases or databases, which was traditionally used to enhance AI reasoning capabilities by referencing verifiable external resources.
Despite the advancements in language models, their implicit knowledge is static and becomes outdated as the world evolves (Lewis et al., 2020; Peng et al., 2023). This limitation necessitates the integration of explicit knowledge sources that are updated continuously, ensuring that AI systems can provide accurate and current responses. The fusion of implicit and explicit knowledge equips AI agents with a more nuanced understanding and the ability to apply knowledge contextually, akin to human intelligence (Gao et al., 2022). Such integration is crucial for crafting knowledge-centric AI agents that not only possess information but can also understand, explain, and employ it, thereby narrowing the chasm between extensive learning and profound knowledge (Marcus and Davis, 2019; Gao et al., 2020). These agents are designed to reason with flexibility and dynamic information about the world, enhancing their robustness and adaptability (Marcus, 2020).
### 5.5 知识与逻辑推理智能体(Knowledge and Logical Inference Agents)
推导和应用知识的能力是人类认知的标志性特征,尤其在复杂的任务中表现得尤为明显,例如逻辑演绎和理解“心智理论”(Theory of Mind)。通过对知识进行推理,确保AI的响应和行为与已知事实和逻辑原则保持一致。这种一致性是维持AI系统信任和可靠性的关键机制,特别是在医疗诊断或法律分析等关键应用中尤为重要。在这里,我们介绍一种将知识与推理相结合的智能体,它们针对智能和推理的特定方面进行了优化。
---
#### **5.5.1 知识智能体(Knowledge Agent)**
知识智能体在其获取的知识体系上以两种方向进行推理:隐性知识(implicit knowledge)和显性知识(explicit knowledge)。
- **隐性知识**
隐性知识通常是像GPT系列这样的大规模语言模型(Brown等人, 2020;OpenAI, 2023)在训练大量文本数据后所封装的内容。这些模型能够生成看似具备理解能力的响应,因为它们依赖于训练过程中隐式学习到的模式和信息。
- **显性知识**
相比之下,显性知识是结构化的,可以直接查询,例如存储在知识库或数据库中的信息。传统上,显性知识通过引用可验证的外部资源来增强AI的推理能力。
尽管语言模型取得了显著进展,但其隐性知识是静态的,并随着世界的演变而逐渐过时(Lewis等人, 2020;Peng等人, 2023)。这一局限性要求我们将不断更新的显性知识源整合到AI系统中,以确保其能够提供准确且与时偕行的响应。隐性知识与显性知识的融合赋予了AI智能体更细致的理解能力,并使其能够像人类智能一样,在具体语境中应用知识(Gao等人, 2022)。这种整合对于构建以知识为核心的AI智能体至关重要,这些智能体不仅具备信息,还能理解、解释并运用这些信息,从而缩小广泛学习与深刻知识之间的鸿沟(Marcus和Davis, 2019;Gao等人, 2020)。这些智能体被设计为能够灵活地、动态地对世界信息进行推理,从而提升其鲁棒性和适应性(Marcus, 2020)。
---
### 科普总结
知识与逻辑推理智能体的目标是让AI具备像人类一样的推理能力。通过结合隐性知识(从大量数据中学习到的模式)和显性知识(结构化、可查询的信息),这些智能体能够在不同场景下灵活应用知识,提供准确且可靠的响应。无论是医疗诊断还是法律分析,这种能力都为AI系统的可信度和实用性奠定了基础。未来,随着技术的进步,知识与逻辑推理智能体将进一步推动AI向更加智能化和人性化的方向发展!
5.5.2 Logic Agents (逻辑智能体)
Generally, a logic agent is a component of a system designed to apply logical reasoning to process data or solve tasks specific to logical inference or logical reasoning. Logic agents within the context of large foundation models like GPT-4 refers to a specialized component or submodules designed to handle logical reasoning tasks. These tasks often involve understanding and manipulating abstract concepts, deducing conclusions from given premises, or solving problems that require a structured, logical approach. Broadly, foundation models like GPT-4 are trained on a vast corpus of text data and learn to perform a wide range of tasks, including those that require some form of logical reasoning. Thus, their capability for logical reasoning is integrated into the overall architecture, and they generally do not possess a distinct, isolated "Logic agent". While GPT-4 and similar models can perform tasks that involve logic, their approach is fundamentally different from how humans or traditional logic-based systems operate. They do not follow formal logical rules or have an explicit understanding of logic; rather, they generate responses based on patterns learned from the training data. As a result, their performance in logical tasks can be impressive, but it can also be inconsistent or limited by the nature of the training data and the inherent limitations of the model’s design. One example of embedding a separate logical submodule into the architecture is (Wang et al., 2023e), which modifies the token embedding process used by LLMs during pre-training by parsing text into logical segments and explicitly modeling logical hierarchies in the token embeddings.
### 5.5.2 逻辑智能体(Logic Agents)
一般来说,逻辑智能体是系统中的一个组件,旨在通过逻辑推理来处理数据或解决与逻辑推导或逻辑推理相关的特定任务。在像GPT-4这样的大型基础模型的背景下,逻辑智能体指的是专门设计用于处理逻辑推理任务的组件或子模块。这些任务通常涉及理解和操作抽象概念、从给定的前提中推导结论,或者解决需要结构化逻辑方法的问题。
总体而言,像GPT-4这样的基础模型是在海量文本数据上训练的,并能够执行广泛的任务,包括那些需要某种形式逻辑推理的任务。因此,其逻辑推理能力被集成到整体架构中,通常并没有一个独立的、专门的“逻辑智能体”。尽管GPT-4和类似的模型可以完成涉及逻辑的任务,但它们的方法与人类或传统的基于逻辑的系统运作方式有根本区别。这些模型并不遵循正式的逻辑规则,也没有对逻辑的显式理解;相反,它们根据从训练数据中学到的模式生成响应。因此,虽然它们在逻辑任务中的表现可能令人印象深刻,但也可能因训练数据的性质和模型设计的固有限制而不一致或受到限制。
一个将独立逻辑子模块嵌入架构的例子是(Wang等人, 2023e),该研究通过将文本解析为逻辑片段,并在预训练过程中显式建模逻辑层次结构,修改了大语言模型(LLM)使用的标记嵌入过程。
---
### 科普总结
逻辑智能体是AI系统中负责逻辑推理的部分,它帮助AI处理复杂的逻辑问题,例如推导结论或解决需要结构化思维的问题。然而,在像GPT-4这样的大型基础模型中,逻辑推理能力并非由一个独立的模块实现,而是融合在整个模型架构中。这些模型依赖于从训练数据中学习到的模式来生成响应,而不是严格遵循正式的逻辑规则。这种机制使得它们在某些逻辑任务中表现出色,但也可能导致结果不一致或受限于训练数据的质量。为了改进这一点,研究人员正在探索将显式的逻辑推理模块嵌入到模型架构中的方法,以进一步提升AI的逻辑推理能力。
5.5.3 Agents for Emotional Reasoning (情感推理智能体)
Emotional understanding and empathy are important skills for agents in many human-machine interactions. To illustrate, one important goal for creating engaging dialogue agents is to have the agents act with increased emotion and empathy while minimizing socially inappropriate or offensive outputs. To advance towards this goal for dialogue agents, we released the Neural Image Commenting with Empathy (NICE) dataset (Chen et al., 2021) consisting of almost two million images and the corresponding human-generated comments and a set of human emotion annotations. We also provided a novel pre-training model - Modeling Affect Gneration for Image Comments (MAGIC) (Chen et al., 2021) - which aims to generate comments for images, conditioned on linguistic representations that capture style and affect, and to help generate more empathetic, emotional, engaging and socially appropriate comments. Our experiments show that the approach is effective in training a more human-like and engaging image comment agent. Developing empathy-aware agents is a promising direction for interactive agents, and it is important to create agents with emotional understanding capabilities across a wide range of groups and populations, especially considering that many current language models exhibit bias in their emotional understanding and empathetic reasoning capabilities (Mao et al., 2022; Wake et al., 2023d).
### 5.5.3 情感推理智能体(Agents for Emotional Reasoning)
情感理解和共情是许多人机交互中智能体的重要技能。例如,创建引人入胜的对话型智能体的一个重要目标是让这些智能体在增加情感和共情的同时,尽量减少社会不适当或冒犯性的输出。为了朝着这一目标推进对话型智能体的发展,我们发布了 **Neural Image Commenting with Empathy (NICE)** 数据集(Chen等人, 2021),该数据集包含近两百万张图像及其对应的人工生成评论和一组人类标注的情感信息。我们还提出了一种新的预训练模型——**Modeling Affect Generation for Image Comments (MAGIC)**(Chen等人, 2021),该模型旨在根据捕捉风格和情感的语义表示生成图像评论,从而帮助生成更具共情、情感丰富、引人入胜且社会适当的评论。我们的实验表明,这种方法在训练更像人类且更具吸引力的图像评论智能体方面非常有效。
开发具备共情意识的智能体是一个很有前景的方向。对于互动型智能体而言,尤为重要的是创建具有情感理解能力的智能体,以覆盖广泛的群体和人口。特别是考虑到许多当前的语言模型在其情感理解和共情推理能力上表现出偏见(Mao等人, 2022;Wake等人, 2023d),这一点显得尤为重要。
---
### 科普总结
情感推理智能体是AI技术中的一个重要方向,旨在让机器更好地理解和表达情感,增强与人类的互动体验。例如,在对话系统中,通过引入情感标注的数据集和新型预训练模型,可以让智能体生成更加共情、富有情感且社会适当的回复。这种技术不仅提升了用户体验,还为开发更人性化的AI奠定了基础。然而,需要注意的是,当前许多语言模型在情感理解方面仍然存在偏见,因此开发公平且包容的情感推理智能体仍然是一个重要的研究方向。
5.5.4 Neuro-Symbolic Agents (神经-符号智能体)
Agent AI:
Surveying the Horizons of Multimodal Interaction A PREPRINT
Neuro-Symbolic agents operate on a hybrid system of neurons and symbols (d’Avila Garcez and Lamb, 2020). To solve problems stated in natural language is a challenging task because it requires explicitly capturing discrete symbolic structural information implicit in the input. However, most general neural sequence models do not explicitly capture such structural information, limiting their performance on these tasks. The work (Chen et al., 2020) propose a new encoder-decoder model based on a structured neural representation agent, The encoder of TP-N2F employs TPR ‘binding’ to encode natural-language symbolic structure in vector space and the decoder uses TPR ‘unbinding’ to generate, in symbolic space, a sequential program represented by relational tuples, each consisting of a relation (or operation) and a number of arguments.
5.5.4 神经-符号智能体(Neuro-Symbolic Agents)
神经-符号智能体运行在一个由神经元和符号混合的系统上(d’Avila Garcez 和 Lamb, 2020)。用自然语言表述的问题解决任务是一个挑战,因为它需要显式捕捉输入中隐含的离散符号结构信息。然而,大多数通用的神经序列模型无法显式捕捉这种结构化信息,这限制了它们在这些任务上的表现。Chen等人(2020)提出了一种基于结构化神经表示的新编码器-解码器模型。TP-N2F 的编码器使用 TPR(Tensor Product Representation)“绑定”操作,在向量空间中对自然语言的符号结构进行编码;解码器则使用 TPR “解绑”操作,在符号空间中生成一个由关系元组组成的顺序程序,每个元组包含一个关系(或操作)及其若干参数。
5.5.5 Instruction following vision-language (VL) models (视觉-语言指令跟随模型)
Instruction following vision-language (VL) models like GPT-4 offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to “point to” and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. In (Park et al., 2023b), we build Localized Visual Commonsense model which allows users to specify (multiple) regions-as-input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt a LLM to collect common sense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. This pipeline is scalable and fully automatic, as no aligned or human-authored image and text pairs are required. With a separately trained critic model that selects high quality examples, we find that training on the localized commonsense corpus expanded solely from images can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in zero-shot settings demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression.
5.5.5 视觉-语言指令跟随模型
像 GPT-4 这样的视觉-语言(VL)模型提供了一个灵活的接口,能够在零样本条件下支持广泛的多模态任务。然而,基于完整图像操作的接口无法直接让用户“指向”并访问图像中的特定区域。这种能力不仅对于支持参考导向的 VL 基准测试至关重要,而且对于需要精确图像内推理的实际应用也非常关键。在 Park 等人(2023b)的研究中,我们构建了一个局部化视觉常识模型(Localized Visual Commonsense Model),允许用户指定(多个)区域作为输入。我们通过从大型语言模型(LLM)中采样局部化的常识知识来训练该模型:具体来说,我们提示 LLM,根据全局字面图像描述和一组 VL 模型自动生成的局部字面区域描述,收集常识知识。这一流程是可扩展且完全自动化的,无需对齐的或人工编写的图像与文本对。通过单独训练的批评模型选择高质量的示例,我们发现仅从图像扩展的局部化常识语料库进行训练,可以成功地将现有的 VL 模型蒸馏为支持以引用作为输入的接口。零样本条件下的实证结果和人工评估表明,我们的蒸馏方法相比直接传递生成的指代表达式的基线方法,能够产生更精确的视觉-语言推理模型。
5.6 LLMs and VLMs Agent (大型语言模型和视觉语言模型智能体)
A number of works leverage LLMs as agents to perform task planning (Huang et al., 2022a; Wang et al., 2023b; Yao et al., 2023a; Li et al., 2023a), and leverage the LLMs’ large internet-scale domain knowledge and zero-shot planning abilities to perform agentic tasks like planning and reasoning. Recent robotics research also leverages LLMs to perform task planning (Ahn et al., 2022a; Huang et al., 2022b; Liang et al., 2022) by decomposing natural language instruction into a sequence of subtasks, either in the natural language form or in Python code , then using a low-level controller to execute these subtasks. Additionally, (Huang et al., 2022b), (Liang et al., 2022), and (Wang et al., 2023a) also incorporate environmental feedback to improve task performance. There have also been a number of works that demonstrate the ability of general-purpose visually-aligned large language models trained on large-scale text, image, and video data to serve as a foundation for creating multi-modal agents that are embodied and can act in various environments (Baker et al., 2022; Driess et al., 2023; Brohan et al., 2023).
5.6 大型语言模型和视觉语言模型智能体(LLMs and VLMs Agent)
许多研究工作利用大型语言模型(LLMs)作为智能体来执行任务规划(Huang等人, 2022a;Wang等人, 2023b;Yao等人, 2023a;Li等人, 2023a),并借助LLMs所具备的互联网规模的广泛领域知识和零样本规划能力,完成诸如规划和推理等智能体任务。近期的机器人研究还通过将自然语言指令分解为一系列子任务(以自然语言形式或Python代码形式表示),然后使用低级控制器执行这些子任务,从而利用LLMs进行任务规划(Ahn等人, 2022a;Huang等人, 2022b;Liang等人, 2022)。此外,(Huang等人, 2022b)、(Liang等人, 2022)以及(Wang等人, 2023a)还将环境反馈纳入其中,以提升任务性能。
同时,也有不少研究展示了基于大规模文本、图像和视频数据训练的通用视觉对齐大型语言模型的能力,这些模型可以作为创建具身化且能够在各种环境中行动的多模态智能体的基础(Baker等人, 2022;Driess等人, 2023;Brohan等人, 2023)。
科普总结
大型语言模型(LLMs)和视觉语言模型(VLMs)在智能体开发中发挥了重要作用。它们不仅能够利用丰富的知识库进行任务规划和推理,还可以通过分解自然语言指令为具体操作步骤,指导机器人完成复杂任务。结合环境反馈可以让这些智能体的表现更加高效和精准。此外,基于多模态数据训练的模型为构建能够在不同环境中行动的多模态智能体提供了重要基础,推动了AI技术的发展。
6 Agent AI Application Tasks (人工智能代理的应用任务)
6.1 Agents for Gaming (游戏中的智能代理)
Games provide a unique sandbox to test the agentic behavior of LLMs and VLMs, pushing the boundaries of their collaborative and decision-making abilities. We describe three areas in particular that highlight agent’s abilities to interact with human players and other agents, as well as their ability to take meaningful actions within an environment.
游戏为测试大型语言模型(LLMs)和视觉语言模型(VLMs)的代理行为提供了一个独特的“沙盒”环境。这种环境能够挑战这些模型在协作能力和决策能力上的极限。我们特别描述了三个领域,它们展示了代理与人类玩家和其他代理互动的能力,以及它们在环境中采取有意义行动的能力。
6.1.1 NPC Behavior (非玩家角色NPC 行为)

In modern gaming systems, the behavior of Non-Player Characters (NPCs) is predominantly dictated by predefined scripts crafted by developers. These scripts encompass a range of reactions and interactions based on various triggers or player actions within the gaming environment. However, this scripted nature often results in predictable or repetitive NPC behavior which fails to evolve in response to player’s actions or the dynamic environment of the game. This rigidity hampers the immersive experience intended in a dynamic gaming environment. Therefore, there is a burgeoning interest in leveraging LLMs to induce autonomy and adaptability in NPC behavior, making interactions more nuanced and engaging. AI-driven NPCs can learn from player behavior, adapt to varying strategies, and provide a more challenging and less predictable gameplay experience. Large Language Models (LLMs) can significantly contribute to evolving NPC behavior in games. By processing vast amounts of text, LLMs can learn patterns and generate responses that are more varied and human-like. They can be utilized to create dynamic dialogue systems, making interactions with NPCs more engaging and less predictable. Furthermore, LLMs can be trained on player feedback and in-game data to continually refine NPC behaviors, making them more attuned to player expectations and game dynamics.
6.1.1 NPC 行为
在现代游戏系统中,非玩家角色(NPC)的行为主要由开发者预先设定的脚本决定。这些脚本涵盖了基于各种触发条件或玩家行为的游戏环境中的一系列反应和互动。然而,这种脚本化的特性往往导致 NPC 的行为变得可预测或重复,无法根据玩家的动作或游戏的动态环境进行演变。这种僵化性削弱了动态游戏环境中所追求的沉浸感。
因此,越来越多的研究兴趣集中在利用大型语言模型(LLMs)来赋予 NPC 自主性和适应性,从而使互动更加细致且引人入胜。由人工智能驱动的 NPC 可以从玩家行为中学习,适应不同的策略,并提供更具挑战性和不可预测的游戏体验。
大型语言模型(LLMs)对推动游戏中 NPC 行为的进化具有重要意义。通过处理海量文本数据,LLMs 能够学习模式并生成更多样化、更接近人类的响应。它们可以用于创建动态对话系统,使与 NPC 的互动更加吸引人且难以预测。此外,LLMs 还可以通过训练玩家反馈和游戏内数据,持续优化 NPC 的行为,使其更符合玩家的期望和游戏动态需求。
6.1.2 Human-NPC Interaction(人类与NPC的交互)
The interaction between human players and NPCs is a crucial aspect of the gaming experience. The conventional interaction paradigm is primarily one-dimensional, with NPCs reacting in a preset manner to player inputs. This limitation stifles the potential for a more organic and enriching interaction, akin to human-human interaction within the virtual realm. The advent of LLM and VLM technologies holds the promise of transforming this paradigm. By employing these technologies, gaming systems can analyze and learn from human behavior to provide more human-like interactions. This not only enhances the realism and engagement of the game but also provides a platform for exploring and understanding human-machine interaction in a controlled yet complex setting.
6.1.2 人类与NPC的交互(Human-NPC Interaction)
人类玩家与NPC(非玩家角色)之间的交互是游戏体验中至关重要的一部分。传统的交互模式主要是一维的,NPC以预设的方式对玩家输入作出反应。这种局限性抑制了更自然、更丰富的交互潜力,而这种潜力本可以接近虚拟世界中的人类与人类之间的交互。随着大语言模型(LLM, Large Language Model)和多模态模型(VLM, Vision-Language Model)技术的出现,这一模式有望被彻底改变。通过运用这些技术,游戏系统可以分析并学习人类行为,从而提供更加拟人化的交互体验。这不仅提升了游戏的真实感和沉浸感,还为在受控但复杂的环境中探索和理解人机交互(Human-Machine Interaction)提供了一个平台。(注意:LLM、VLM等重要名词保留中英文信息)
6.1.3 Agent-based Analysis of Gaming (基于智能体的游戏分析)
Gaming is an integral part of daily life, estimated to engage half of the world’s population4. Additionally, it exhibits a positive impact on mental health5. However, contemporary game systems exhibit a deficiency in interactions with human players since their behaviors are primarily hand-crafted by game developers. These pre-programmed behaviors frequently fail to adapt to players’ needs. Consequently, there exists a need for new AI systems in games that can analyze player behaviors and furnish appropriate support when necessary. Intelligent interactive systems bear the potential to revolutionize how gamers interact with gaming systems in general. NPCs’ interactions with gamers are no longer confined by the restricted rule sets designed by game developers. They have the potential to adapt seamlessly to gamers’ experiences, providing timely feedback to enrich the gaming experience and elevate the synergy of human-machine interaction.

Figure 9: GPT-4V can effectively predict the high-level next actions when given the “action history" and a “gaming target" in the prompt. Furthermore, GPT-4V accurately recognized that the player is holding wooden logs in their hand and can incorporate this perceived information into its plan for future actions. Although GPT-4V appears to be capable of predicting some low-level actions (such as pressing ‘E‘ to open the inventory), the model’s outputs are not inherently suitable for raw low-level action prediction (including mouse movements) and likely requires supplemental modules for low-level action control.
LLMs can serve as a robust tool for analyzing in-game text data, including chat logs, player feedback, and narrative content. They can help in identifying patterns of player behavior, preferences, and interactions which can be invaluable for game developers to improve game mechanics and narratives. Additionally, VLMs can parse through large quantities of image and video data from gaming sessions to help analyze user intent and actions within the game world. Moreover, LLMs and VLMs can facilitate the development of intelligent agents within games that can communicate with players and other agents in a sophisticated and human-like manner, enhancing the overall gaming experience. Beyond LLMs and VLMs, user input data, provides a promising avenue for creating game-playing agents that model perception, game playing, and game understanding by imitating human players. By incorporating a combination of player interactions and feedback, pixel inputs, and natural language planning and understanding, agent models can assist in the continuous improvement of game dynamics, driving a more player-centric evolution of the gaming environment.
6.1.3 基于智能体的游戏分析(Agent-based Analysis of Gaming)
游戏是日常生活的重要组成部分,据估计吸引了全球一半人口的参与(4)。此外,游戏对心理健康也有积极影响(5)。然而,现代游戏系统在与人类玩家的交互方面存在不足,因为NPC的行为主要由游戏开发者手动设计。这些预先编程的行为往往无法适应玩家的需求。因此,游戏中需要新的AI系统来分析玩家行为,并在必要时提供适当的支持。智能交互系统有潜力彻底改变游戏玩家与游戏系统之间的互动方式。NPC与玩家的交互不再局限于游戏开发者设计的有限规则集,而是能够无缝适应玩家的游戏体验,并及时提供反馈,从而丰富游戏体验,提升人机交互的协同性。
图9:当给定“动作历史”和“游戏目标”时,GPT-4V可以有效预测高层次的下一步动作。此外,GPT-4V准确识别出玩家手中持有木头,并将这一感知信息融入其未来的行动计划中。尽管GPT-4V似乎能够预测某些低层次的动作(例如按下‘E’键打开背包),但模型的输出并不完全适用于原始低层次动作预测(如鼠标移动),可能需要额外的模块来实现低层次动作控制。
大语言模型(LLM, Large Language Model)可以用作分析游戏内文本数据的强大工具,包括聊天记录、玩家反馈和叙事内容。它们有助于识别玩家行为、偏好和交互模式,这对游戏开发者改进游戏机制和叙事具有重要价值。此外,多模态模型(VLM, Vision-Language Model)可以解析大量来自游戏会话的图像和视频数据,帮助分析玩家在游戏中意图和行为。不仅如此,LLM和VLM还可以助力开发游戏中的智能体(Intelligent Agents),使这些智能体能够以复杂且拟人化的方式与玩家和其他智能体进行交流,从而提升整体游戏体验。除了LLM和VLM之外,用户输入数据为创建模仿人类玩家的游戏智能体提供了有前景的方向,这些智能体可以通过建模感知、游戏玩法和游戏理解来推动发展。通过结合玩家交互与反馈、像素输入、自然语言规划与理解,智能体模型可以协助持续改进游戏动态,推动以玩家为中心的游戏环境进化。
(注意:LLM、VLM等重要名词保留中英文信息,括号内内容不翻译)
6.1.4 Scene Synthesis for Gaming (游戏场景合成)
Scene synthesis is a vital component in the creation and enhancement of immersive gaming environments. It en- tails the automatic or semi-automatic generation of three- dimensional (3D) scenes and environments within a game. This process includes the generation of terrain, placement of objects, creation of realistic lighting, and sometimes even dynamic weather systems.
Modern games often feature vast, open-world environ- ments. Manually designing these landscapes can be in- credibly time-consuming and resource-intensive. Au- tomated terrain generation, often leveraging procedural or AI-driven techniques, can produce complex, realistic landscapes with less manual effort. LLMs and VLMs can utilize the internet scale knowledge to formulate rules to design non-repeating landscapes that are visually impres- sive and unique. Additionally, LLMs and VLMs can be used to ensure the semantic consistency and variability of generated assets. Placing objects such as buildings, vege- tation, and other elements within a scene in a realistic and aesthetically pleasing manner is crucial for immersion.

Figure10: MaskedvideopredictiononunseenMinecraft videos. From left to right: the original frame, the masked frame, the reconstructed frame, and the reconstructed frame with patches.
VLMs and LLMs can assist in object placement by ad-hering to predefined or learned rules and aesthetics, thus
speeding up the level design process. VLMs and LLMs can be further trained to understand the principles of design and aesthetics, aiding in the procedural generation of content. They can help formulate rules or guidelines that procedural algorithms can follow to generate objects, and scenes that are both visually appealing and contextually appropriate.
Realistic lighting and atmospheric effects are fundamental for creating a believable and engaging gaming environment. Advanced algorithms can simulate natural lighting conditions and dynamic weather effects, enhancing the realism and mood of the scene. LLMs can help develop systems to acheive more realistic lighting and atmospheric effects in several innovative ways. VLMs can analyze vast datasets from real-world lighting and atmospheric conditions to help develop more realistic algorithms for simulating these effects in games. By understanding the patterns and intricacies of natural lighting and weather, these models can contribute to the development of algorithms that mimic reality closely. LLMs and VLMs could also be used to develop systems that adjust lighting and atmospheric effects in real-time based on player actions, game states, or external inputs. They can process natural language commands from players to modify the game environment, providing a more interactive and immersive experience.
6.1.4 游戏场景合成(Scene Synthesis for Gaming)
场景合成为创建和增强沉浸式游戏环境提供了至关重要的支持。它涉及在游戏中自动或半自动地生成三维(3D)场景和环境。这一过程包括地形生成、物体放置、真实光照的创建,有时甚至包括动态天气系统的模拟。
现代游戏通常包含广阔、开放世界的环境。手动设计这些景观可能极其耗时且资源密集。通过利用程序化或AI驱动的技术实现的自动化地形生成,可以以较少的人工努力生成复杂且逼真的景观。大语言模型(LLM, Large Language Model)和多模态模型(VLM, Vision-Language Model)能够利用互联网规模的知识来制定规则,设计非重复的、视觉上令人印象深刻的独特景观。此外,LLM和VLM可用于确保生成资产的语义一致性和多样性。在场景中以真实且美观的方式放置建筑物、植被和其他元素对于提升沉浸感至关重要。
图10:在未见过的《我的世界》视频上的掩码视频预测。从左到右:原始帧、掩码帧、重建帧以及带有补丁的重建帧。
VLM和LLM可以通过遵循预定义或学习到的规则与美学原则来辅助物体放置,从而加速关卡设计过程。VLM和LLM还可以进一步训练以理解设计和美学原理,帮助进行内容的程序化生成。它们可以帮助制定规则或指南,供程序化算法遵循,以生成既视觉吸引人又上下文适当的物体和场景。
真实的光照和大气效果是创建可信且引人入胜的游戏环境的基础。先进的算法可以模拟自然光照条件和动态天气效果,增强场景的真实感和氛围。LLM可以通过多种创新方式帮助开发更真实的光照和大气效果系统。VLM可以分析来自现实世界光照和大气条件的海量数据集,以帮助开发更真实的模拟这些效果的游戏算法。通过理解自然光照和天气的模式与细节,这些模型有助于开发能够紧密模仿现实的算法。此外,LLM和VLM还可用于开发根据玩家动作、游戏状态或外部输入实时调整光照和大气效果的系统。它们可以处理玩家发出的自然语言命令以修改游戏环境,提供更加互动和沉浸式的体验。
(注意:LLM、VLM等重要名词保留中英文信息,括号内内容不翻译)
6.1.5 Experiments and Results (实验与结果 )
Zero-shot/Few-shot Learning with LLM or LVM. As we showed in the Fig. 8 and Fig. 9, we used GPT-4V for high-level description and action prediction. Fig. 8 showed some qualitative examples of action description generation and editing with GPT-4V. Agent-enhanced text opens up a novel method of generating 3D scenes with game action priors to help improve the naturalness of the scene. Consequently, GPT-4V generates relevant high-level descriptions that are appropriate for the gaming videos.
Small Agent Pretraining Model. To showcase our agent vision-language architecture, we first study its application in a widely used domain for gaming agents by pretraining on Minecraft data. As shown in Fig. 7, given an input action agent, key frame of video, and corresponding text, a standard encoder-decoder can be employed to convert the agent ac- tion and image into action text token and image patch token and then use the agent-vision-language decoder to convert it into a action prediction sentence. The overall architecture is depicted in Fig. 7. We evaluate our approach with several Minecraft demonstrations. The Minecraft video data consists of 5min clips, and we use for pretraining contains 78K videos, and we used 5K videos (6% of pretraining data) for the first round pretraining. We train a 250M parameter model on 16 NVIDIA v100 GPUs for one day and visualize our model out- puts in Fig. 10 and Fig. 11. Fig. 10 shows that our relatively small agent architecture can produce reasonable outputs for Minecraft scenes unseen during training. Fig. 11 showed the model’s predictions compared to the ground truth human player actions indicating potential low-level understanding for our small agent model.

Multi-Agent Infrastructure. As showed in the agent paradigm in Fig. 5, we designed a novel infrastructure for a new gaming scenario called “CuisineWorld" (Gong et al., 2023a). We detail our approach in Fig. 12. Our infrastructure allows for multi-agent collaboration by leveraging GPT-4 as a central planner and works across multiple gaming domains. We investigated our system’s multi-agent planning capabilities, and we deployed the infrastructure into real-world video games to demonstrate its multi-agent and human-AI collaboration effectiveness. Additionally, we presented “Cuisineworld", a text-based multi-agent collaboration benchmark that provides a new auto-metric Collaboration Score (CoS) to quantify collaboration efficiency.
Please refer to the Appendix for more examples and details for gaming description, high-level action prediction, and GPT-4V prompting. We show examples for Bleeding Edge in Fig. 32 and Appendix B, Microsoft Flight Simulator in Fig. 33 and Appendix C, ASSASSIN’s CREED ODYSSEY in Fig. 34 and Appendix D, GEARS of WAR 4 in Fig. 35 and Appendix E, and Starfield in Fig. 36 and Appendix F. We also provide a detailed screenshot of the prompting process for GPT4V used to generate Minecraft examples with Fig. 31 in Appendix A.
6.1.5 实验与结果
零样本/少样本学习(Zero-shot/Few-shot Learning)与LLM或LVM。如我们在图8和图9中所示,我们使用GPT-4V进行高层次描述和动作预测。图8展示了一些通过GPT-4V生成和编辑动作描述的定性示例。智能体增强文本为结合游戏动作先验生成3D场景提供了一种新方法,有助于提升场景的自然性。因此,GPT-4V能够生成适合游戏视频的相关高层次描述。
小型智能体预训练模型(Small Agent Pretraining Model)。为了展示我们的智能体视觉-语言架构,我们首先研究了其在广泛使用的《我的世界》(Minecraft)数据上预训练的应用。如图7所示,给定输入的动作智能体、视频的关键帧及对应的文本,可以使用标准的编码器-解码器将智能体动作和图像转换为动作文本标记和图像补丁标记,然后利用智能体-视觉-语言解码器将其转换为动作预测句子。整体架构如图7所示。我们通过几个《我的世界》演示来评估我们的方法。《我的世界》视频数据由5分钟片段组成,我们用于预训练的数据集包含78,000个视频,并使用了其中5,000个视频(占预训练数据的6%)进行第一轮预训练。我们在16块NVIDIA V100 GPU上训练了一个参数量为2.5亿的模型,耗时一天,并在图10和图11中可视化了模型输出。图10表明,我们的相对较小的智能体架构能够在未见过的《我的世界》场景中生成合理的输出。图11展示了模型预测与真实玩家动作的对比,表明我们的小型智能体模型具有潜在的低层次理解能力。
多智能体基础设施(Multi-Agent Infrastructure)。如图5中的智能体范式所示,我们为一个新的游戏场景“CuisineWorld”(Gong等人,2023a)设计了一种新颖的基础设施。我们在图12中详细介绍了我们的方法。该基础设施通过利用GPT-4作为中央规划器,支持跨多个游戏领域的多智能体协作。我们研究了系统的多智能体规划能力,并将基础设施部署到现实世界的视频游戏中,以展示其多智能体和人机协作的有效性。此外,我们提出了“Cuisineworld”,这是一个基于文本的多智能体协作基准测试,提供了一个新的自动化指标——协作得分(Collaboration Score, CoS),用于量化协作效率。
更多关于游戏描述、高层次动作预测以及GPT-4V提示的例子和细节,请参见附录。我们在图32和附录B中展示了《Bleeding Edge》的例子,在图33和附录C中展示了《Microsoft Flight Simulator》的例子,在图34和附录D中展示了《ASSASSIN’s CREED ODYSSEY》的例子,在图35和附录E中展示了《GEARS of WAR 4》的例子,在图36和附录F中展示了《Starfield》的例子。我们还在附录A的图31中提供了生成《我的世界》示例时GPT-4V提示过程的详细截图。
(注意:LLM、VLM、GPT-4V等重要名词保留中英文信息,括号内内容不翻译)
6.2 Robotics (机器人)
Robots are representative agents that necessitate effective interaction with their environment. In this section, we will introduce key elements essential for efficient robotic operation, review research topics where the latest LLM/VLM technologies have been applied, and share findings from our most recent studies.
Visual Motor Control. Visual Motor Control refers to the integration of visual perception and motor action to execute tasks effectively in a robotic system. This integration is paramount as it enables robots to interpret the visual data from their environment and accordingly adjust their motor actions to interact with the environment accurately. For instance, in an assembly line, a robot equipped with visual motor control can perceive the position and orientation of objects and accurately align its manipulator to interact with these objects. This capability is essential for ensuring the precision and effectiveness of robotic operations across a myriad of applications, ranging from industrial automation to assisting the elderly in their daily chores. Moreover, visual motor control facilitates robots in adapting to dynamic environments

Figure 12: The MindAgent of in-context learning gaming Infrastructure. Planning Skill and Tool Use: The game environment requires diverse planning skills and tool use to complete tasks. It generates relevant game information and converts the game data into a structured text format that the LLMs can process. LLM: The main workhorse of our infrastructure makes decisions, thus serving as a dispatcher for the multi-agent system. Memory History: A storage utility for relevant information. Action Module: Extracts actions from text inputs and convertd them into domain-specific language and validates DSLs so that they cause no errors during execution.
where the state of the environment may change rapidly, requiring real-time adjustments to motor actions based on visual feedback.
Additionally, within the context of safe operation, visual information is crucial for detecting execution errors and confirming the pre- and post-conditions of each robot action. In uncontrolled environments, such as unknown domestic settings, robots are more likely to face unexpected outcomes due to unpredictable factors like changing furniture shapes, varied lighting, and slippage. Executing a pre-planned action plan solely in a feedforward manner can pose significant risks in these settings. Therefore, utilizing visual feedback to continually verify outcomes at each step is key to ensuring robust and reliable operation of robotic systems.
Language Conditioned Manipulation. Language Conditioned Manipulation entails the ability of a robotic system to interpret and execute tasks based on language instructions. This aspect is particularly crucial for creating intuitive and user-friendly interfaces for human-robot interaction. Through natural language commands, users can specify goals and tasks to robots in a manner similar to human-human communication, thereby lowering the barrier to operating robotic systems. In a practical scenario, for instance, a user could instruct a service robot to “pick up the red apple from the table,” and the robot would parse this instruction, identify the referred object and execute the task of picking it up (Wake et al., 2023c). The core challenge lies in developing robust natural language processing and understanding algorithms that can accurately interpret a wide array of instructions, ranging from direct commands to more abstract directives, and enable the robot to convert these instructions into actionable tasks. Furthermore, ensuring that robots can generalize these instructions across diverse tasks and environments is critical for enhancing their versatility and utility in real-world applications. The use of language input to guide robot’s task planning has gained attention in the context of a robot framework called Task and Motion Planning (Garrett et al., 2021).
Skill Optimization. Recent studies highlight the effectiveness of LLMs in robotic task planning. However the optimal execution of tasks, especially those involving physical interactions like grasping, requires a deeper understanding of the environment that goes beyond simply interpreting human instructions. For example, robot grasping necessitates precise contact points (Wake et al., 2023e) and arm posture (Sasabuchi et al., 2021) to efficiently execute subsequent actions.
While these elements—precise contact points and arm posture—are intuitive for humans, articulating them through language is challenging. Despite advances in internet-scale VLMs, capturing these nuanced indirect cues from scenes and translating them effectively into robotic skills remains a significant challenge. In response, the robotics community is increasingly focusing on collecting enhanced datasets(e.g., (Wang et al., 2023d; Padalkar et al., 2023)) or developing methodologies for direct skill acquisition from human demonstrations (Wake et al., 2021a). Frameworks including Learning-from-Demonstration and Imitation Learning are leading these developments, playing a crucial role in the optimization of physical skills.
6.2 机器人(Robotics)
机器人是需要与环境进行有效交互的典型智能体(agents)。在本节中,我们将介绍高效机器人操作所需的关键要素,回顾最新LLM/VLM技术已应用于其中的研究主题,并分享我们最近研究的发现。
视觉运动控制(Visual Motor Control)。视觉运动控制指的是在机器人系统中将视觉感知和运动动作相结合,以有效执行任务。这种结合至关重要,因为它使机器人能够解释来自环境的视觉数据,并相应调整其运动动作,从而准确地与环境互动。例如,在装配线上,配备了视觉运动控制的机器人可以感知物体的位置和方向,并精确调整其操作臂以与这些物体互动。这种能力对于确保机器人操作的精度和有效性至关重要,其应用场景从工业自动化到协助老年人完成日常家务不等。此外,视觉运动控制还帮助机器人适应动态环境,其中环境状态可能快速变化,要求根据视觉反馈实时调整运动动作。
图12:上下文学习游戏基础设施中的MindAgent。规划技能与工具使用(Planning Skill and Tool Use):游戏环境需要多样化的规划技能和工具使用来完成任务。它生成相关游戏信息,并将游戏数据转换为LLM可处理的结构化文本格式。LLM(大语言模型,Large Language Model):我们基础设施的核心工作模块,负责做出决策,从而作为多智能体系统的调度器。记忆历史(Memory History):用于存储相关信息的工具。动作模块(Action Module):从文本输入中提取动作,并将其转换为领域特定语言(DSL),同时验证DSL以确保执行过程中不会出现错误。
在安全操作的背景下,视觉信息对于检测执行错误以及确认每个机器人动作的前置和后置条件至关重要。在不受控环境中,例如未知的家庭场景中,由于家具形状的变化、光照差异和滑动等不可预测因素,机器人更可能面临意外结果。仅通过前馈方式执行预先规划的动作计划在这些情况下可能存在重大风险。因此,利用视觉反馈在每一步持续验证结果是确保机器人系统稳健可靠运行的关键。
语言引导的操作(Language Conditioned Manipulation)。语言引导的操作是指机器人系统根据语言指令解释和执行任务的能力。这一方面对于创建直观且用户友好的人机交互界面尤为重要。通过自然语言命令,用户可以像人与人交流一样向机器人指定目标和任务,从而降低操作机器人系统的门槛。例如,在实际场景中,用户可以指示服务机器人“从桌子上拿起红苹果”,机器人将解析该指令,识别所指对象并执行拾取任务(Wake等人,2023c)。核心挑战在于开发强大的自然语言处理和理解算法,能够准确解释从直接命令到更抽象指令的各种指令,并使机器人将这些指令转化为可执行任务。此外,确保机器人能够在不同任务和环境中泛化这些指令对于增强其在现实世界应用中的多功能性和实用性至关重要。在名为任务与运动规划(Task and Motion Planning)的机器人框架中,使用语言输入指导机器人的任务规划已引起关注(Garrett等人,2021)。
技能优化(Skill Optimization)。近期研究表明,LLM在机器人任务规划中具有有效性。然而,任务的最优执行,尤其是涉及物理交互的任务(如抓取),需要对环境有更深的理解,而不仅仅是简单地解释人类指令。例如,机器人抓取需要精确的接触点(Wake等人,2023e)和手臂姿态(Sasabuchi等人,2021),以高效执行后续动作。
尽管这些要素——精确的接触点和手臂姿态——对人类来说是直观的,但通过语言表达它们却颇具挑战性。尽管互联网规模的VLM(多模态模型,Vision-Language Model)取得了进展,但从场景中捕捉这些细微的间接线索并有效转化为机器人技能仍然是一个重大挑战。为此,机器人研究社区正越来越多地专注于收集增强的数据集(例如,Wang等人,2023d;Padalkar等人,2023),或开发从人类演示中直接获取技能的方法(Wake等人,2021a)。包括Learning-from-Demonstration和Imitation Learning在内的框架正在引领这些发展,在物理技能优化中发挥了关键作用。
(注意:LLM、VLM等重要名词保留中英文信息,括号内内容不翻译)
6.2.1 LLM/VLM Agent for Robotics. (机器人领域的LLM/VLM智能体 )
Recent research has demonstrated the potential of LLM/VLMs for robotic agents that involve interactions with humans in an environment. Research topics that aim to leverage latest LLM/VLM technologies include:
Multimodal Systems: Recent research has been actively focusing on developing end-to-end systems that incorporate the latest LLM and VLM technologies as encoders for input information. Particularly, there is a significant trend towards modifying these foundation models to process multimodal information. (Jiang et al., 2022; Brohan et al., 2023, 2022; Li et al., 2023d; Ahn et al., 2022b; Shah et al., 2023b; Li et al., 2023e). This adaptation aims to guide robotic actions based on both linguistic instructions and visual cues, thus achieving an effective embodiment.
Task Planning and Skill Training: In contrast to end-to-end systems, Task And Motion Planning (TAMP) based systems first compute a high-level task plan and then achieve them with low-level robot control, known as skills.
The advanced language processing abilities of LLMs have demonstrated the capability to interpret instructions and decompose them into robot action steps, greatly advancing task planning technologies (Ni et al., 2023; Li et al., 2023b; Parakh et al., 2023; Wake et al., 2023c). For skill training, several studies have explored the use of LLMs/VLMs for designing reward functions (Yu et al., 2023a; Katara et al., 2023; Ma et al., 2023), generating data to facilitate policy learning (Kumar et al., 2023; Du et al., 2023), or serving as part of a reward function (Sontakke et al., 2023). Together with training frameworks such as RL and IL, these efforts will contribute to the development of efficient robot controllers.
On-site Optimization: Executing long task steps in robotics can be difficult due to unexpected and unpredictable environmental conditions. Therefore, a significant challenge in the field of robotics involves dynamically adapting and refining robotic skills by integrating task plans with real-time environmental data. For instance, (Ahn et al., 2022b) proposed an approach that calculates the feasibility of actions (i.e., affordance) from visual information and compares it with planned tasks. Additionally, there are approaches that focus on enabling LLMs to output the pre-conditions and post-conditions (e.g., states of objects and their interrelationships) of task steps to optimize their execution (Zhou et al., 2023c) and detect pre-condition errors for necessary revisions to the task plan (Raman et al., 2023). These strategies seek to achieve environment-grounded robot execution by integrating environmental information and adjusting the robot’s actions at the task plan or controller level.
Conversation Agents: In creating conversational robots, LLMs can contribute to natural, context-sensitive interactions with humans (Ye et al., 2023a; Wake et al., 2023f). These models process and generate responses that mimic human conversation, allowing robots to participate in meaningful dialogues. Additionally, LLMs play a significant role in the estimation of conceptual (Hensel et al., 2023; Teshima et al., 2022) and emotional attributes (Zhao et al., 2023; Yang et al., 2023b; Wake et al., 2023d) of utterances. Those attributes facilitate the understanding of human intent and meaningful gesture generation, thus contributing to the naturalness and efficacy of human-robot communication.
Navigation Agents: Robot navigation has a long history of research, focusing on core aspects such as map-based path planning and Simultaneous Localization and Mapping (SLAM) for creating environmental maps. These functionalities have become standard in widely used robot middleware like the Robot Operating System (ROS) (Guimarães et al., 2016).
While classic navigation techniques remain prevalent in many robotics applications, they typically rely on static or pre-created maps. Recently, there has been an increased interest in advanced technologies that enable robots to navigate in more challenging environments, leveraging breakthroughs in fields like computer vision and natural language processing. One representative task is object navigation (Chaplot et al., 2020a; Batra et al., 2020; Gervet et al., 2023; Ramakrishnan et al., 2022; Zhang et al., 2021), where robots use object names for navigation instead of map coordinates, requiring the visual grounding of object names in the environment. Furthermore, recent attention has been given to

Figure 13: Overview of the robot teaching system that integrates a ChatGPT-empowered task planner. The process involves two steps: Task planning, where the user employs the task planner to create an action sequence and adjusts the result through feedback as necessary, and Demonstration, where the user visually demonstrates the action sequence to provide information needed for robot operation. The vision system collects visual parameters that will be used for robot execution.
technologies that navigate robots in entirely unfamiliar new environments on a zero-shot basis, on top of foundation models, so-called zero-shot object navigation (Gadre et al., 2023; Dorbala et al., 2023; Cai et al., 2023). Additionally, Vision-Language Navigation (VLN) (Anderson et al., 2018a) is a representative task, where the task involves navigating an agent by natural language instructions in previously unseen, real-world environments (Shah et al., 2023a; Zhou et al., 2023a; Dorbala et al., 2022; Liang et al., 2023; Huang et al., 2023b). VLN interprets sentences rather than object names, such as “go to the bathroom on your left.,” thus it requires a higher functionality to parse input text (Wang et al., 2019).
The advent of foundation models contributes to the development of such adaptive, on-the-fly navigation technologies by enhancing the understanding of human language instructions and the visual interpretation of environmental information. More detailed explanations of representative VLN research are provided in 6.2.2.
6.2.1 机器人领域的LLM/VLM智能体
近期研究展示了LLM/VLM在涉及与人类交互的机器人智能体中的潜力。旨在利用最新LLM/VLM技术的研究主题包括:
**多模态系统(Multimodal Systems)**
近期研究积极致力于开发端到端系统,将最新的LLM和VLM技术作为输入信息的编码器。特别是,一个显著趋势是修改这些基础模型以处理多模态信息(Jiang等人,2022;Brohan等人,2023, 2022;Li等人,2023d;Ahn等人,2022b;Shah等人,2023b;Li等人,2023e)。这种适应性旨在结合语言指令和视觉线索来指导机器人动作,从而实现有效的实体化。
**任务规划与技能训练(Task Planning and Skill Training)**
与端到端系统不同,基于任务与运动规划(TAMP)的系统首先计算高层次的任务计划,然后通过低层次的机器人控制(即技能)来实现任务目标。LLM的高级语言处理能力已证明能够解释指令并将其分解为机器人动作步骤,极大地推动了任务规划技术的发展(Ni等人,2023;Li等人,2023b;Parakh等人,2023;Wake等人,2023c)。在技能训练方面,一些研究探索了使用LLM/VLM设计奖励函数(Yu等人,2023a;Katara等人,2023;Ma等人,2023),生成数据以促进策略学习(Kumar等人,2023;Du等人,2023),或作为奖励函数的一部分(Sontakke等人,2023)。结合强化学习(RL)和模仿学习(IL)等训练框架,这些努力将有助于开发高效的机器人控制器。
**现场优化(On-site Optimization)**
在机器人领域中,由于不可预测的环境条件,执行长任务步骤可能非常困难。因此,一个重要的挑战是通过将任务计划与实时环境数据相结合,动态调整和优化机器人的技能。例如,Ahn等人(2022b)提出了一种方法,从视觉信息中计算动作的可行性(即可操作性),并与计划任务进行比较。此外,还有一些方法专注于使LLM能够输出任务步骤的前置条件和后置条件(例如,物体状态及其相互关系),以优化执行过程(Zhou等人,2023c),并检测前置条件错误以对任务计划进行必要的修订(Raman等人,2023)。这些策略通过整合环境信息并在任务计划或控制器级别调整机器人的动作,实现了基于环境的机器人执行。
**对话智能体(Conversation Agents)**
在创建会话型机器人时,LLM可以促进自然且上下文敏感的人机交互(Ye等人,2023a;Wake等人,2023f)。这些模型处理和生成类似人类对话的响应,使机器人能够参与有意义的对话。此外,LLM在估计话语的概念属性(Hensel等人,2023;Teshima等人,2022)和情感属性(Zhao等人,2023;Yang等人,2023b;Wake等人,2023d)方面也发挥了重要作用。这些属性有助于理解人类意图并生成有意义的手势,从而提升人机通信的自然性和有效性。
**导航智能体(Navigation Agents)**
机器人导航是一个历史悠久的研究领域,重点在于基于地图的路径规划和同时定位与建图(SLAM)以创建环境地图。这些功能已成为广泛使用的机器人中间件(如机器人操作系统ROS)的标准功能(Guimarães等人,2016)。
尽管经典导航技术在许多机器人应用中仍然占主导地位,但它们通常依赖静态或预先创建的地图。最近,人们越来越关注利用计算机视觉和自然语言处理等领域突破的技术,使机器人能够在更具挑战性的环境中导航。一个代表性任务是物体导航(Chaplot等人,2020a;Batra等人,2020;Gervet等人,2023;Ramakrishnan等人,2022;Zhang等人,2021),在这种任务中,机器人使用物体名称而非地图坐标进行导航,需要在环境中对物体名称进行视觉定位。此外,最近还关注了基于零样本的基础模型技术,使机器人能够在完全陌生的新环境中进行导航,即所谓的零样本物体导航(Gadre等人,2023;Dorbala等人,2023;Cai等人,2023)。另外,视觉-语言导航(VLN)(Anderson等人,2018a)是一个代表性任务,其中任务涉及通过自然语言指令在未见过的真实世界环境中导航智能体(Shah等人,2023a;Zhou等人,2023a;Dorbala等人,2022;Liang等人,2023;Huang等人,2023b)。VLN解析句子而非物体名称,例如“向左转去浴室”,因此需要更高的功能来解析输入文本(Wang等人,2019)。
基础模型的出现通过增强对人类语言指令的理解和对环境信息的视觉解释,推动了此类适应性强、即时导航技术的发展。关于代表性VLN研究的更详细解释将在6.2.2节中提供。
---
(注意:LLM、VLM等重要名词保留中英文信息,括号内内容不翻译)
6.2.2 Experiments and Results. (实验与结果 )
An accumulating body of evidence suggests that recent VLMs and LLMs have promising capabilities for symbolic task planning (e.g., what-to-do). However, each task requires low-level control policy (e.g., how-to-do) to achieve successful interaction between the environment. While reinforcement learning and imitation learning are promising approach to learn policies in a data-driven manner, another promising approach is to obtain the strategy directly from humans through on-site demonstration, an approach called Learning-from-Observation (Wake et al., 2021a; Ikeuchi et al., 0). In this section, we introduce a study where we employ ChatGPT for task planning and enrich the plan by parameterizing it with affordance information to facilitate effective and precise execution (Fig. 13).
The pipeline was composed of two modules: task planning and parameterization. In task planning, the system is fed with language instructions and the description of the working environment. These instructions, along with a predefined set of robot actions and output specifications, are compiled into a comprehensive prompt provided to ChatGPT, which then generates a sequence of decomposed tasks with their textual descriptions (Fig. 13; left pane). Notably, we employ a few-shot approach, meaning ChatGPT is not trained on this task, offering an advantage in applicability as it eliminates the need for hardware-dependent data collection and model training. Additionally, the textual descriptions in the output

Figure 14: Example of adjusting an output sequence through auto-generated feedback. We use an open-sourced simulator, VirtualHome for the experiment. Given an instruction “Take the pie on the table and warm it using the stove.,” the task planner plans a sequence of functions that are provided in VirtualHome. If an error in execution is detected, the task planner correct its output based on the auto-generated error message.
enable the user to check and adjust the results as necessary, which is a crucial feature for a safe and robust operation. Fig. 14 shows the qualitative results conducted for an agentic simulation on top of VirtualHome (Puig et al., 2018). The results demonstrate a reasonable task plan and its flexibility in adjusting outputs, indicating the broad applicability of our approach.
While the task planner guarantees coherency between the task sequences, successful operation in reality requires detailed parameters. For example, grasp type is crucial for carrying a container while spilling out the content, such a parameter is often ignored in a simulators (see Fig. 14 in grasping a pie). In our robot system, therefore, users are asked to demonstrate each action visually (Fig. 13; right pane). The tasks had predefined parameters necessary for execution, which our vision system extracts from the videos (Wake et al., 2021b). Notably, our robotic system is not designed for exact replication of human motions (i.e., teleoperation) but rather to handle varying real-world conditions, such as changes in object locations. Hence, the parameters extracted from human demonstrations encompass not precise motion paths but affordance information that dictates effective environmental movement (e.g., waypoints for collision avoidance (Wake et al., 2023a), grasp types (Wake et al., 2023e), and upper-limbs postures (Sasabuchi et al., 2021; Wake et al., 2021a)). The posture of the upper limbs is critical in robots with high degrees of freedom and is designed to assume predictable postures for humans coexisting with the operational robot. The task sequence endowed with affordances is transformed into a sequence of reusable robot skills acquired through reinforcement learning and executed by the robot (Takamatsu et al., 2022).
LLM-empowered task planning can be extended to a more versatile robotic system by integrating it with VLMs. Here, we show an example where we use the GPT-4V(ision) to broaden the aforementioned task planner in a multimodal input context (Fig. 15), a human performs actions that are intended to be replicated by the robot. In this paper, only part of the prompt is shown. The whole prompt is available at microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts.
This pipeline takes demonstration videos and text, then outputs a sequence of robot actions. A vision analyzer aims to understand the actions performed by humans in the video. We used GPT-4V and provided a prompt to generate text instructions in a style typical of human-to-human communication.Fig. 16 demonstrates how the usage of text


input allows user to give feedback on GPT-4V’s recognition results for correction purposes. Such a feature, aiming at improving the accuracy of the recognition results, also enables more robust operation.
Next, the scene analyzer compiles the expected work environment into the text information based on the instructions and the first frame of the video data (or an image of the environment). This environmental information includes a list of object names recognized by GPT-4V, the graspable properties of objects, and the spatial relationships between objects. Although these computational processes are a black box within GPT-4V, the information is output based on the knowledge of GPT-4V and the image/text input. Fig. 17 shows the example outputs of our scene analyzer. As shown in the figure, GPT-4V successfully selects the objects that are related to the manipulation. For example, a table is included in the output when the human is relocating a spam container on the table, while the table is ignored for the fridge opening task. These results suggest that the scene analyzer encodes the scene information with respect to the human’s actions. We prompted GPT-4V to explain the results of the object selection process and the reasons behind those choices. In practice, we found this approach resulted in reasonable outputs. Finally, based on the given text instructions and environmental information, the task planner outputs a sequence of tasks (Wake et al., 2023c).

Embodied Agents for Robotics Navigation. Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. Navigation in 3D environments (Zhu et al., 2017a; Mirowski et al., 2016; Mousavian et al., 2018; Hemachandra et al., 2015) is an essential capability of a mobile intelligent system that functions in the physical world. In the past few years, a plethora of tasks and evaluation protocols (Savva et al., 2017; Kolve et al., 2017; Song et al., 2017; Xia et al., 2018; Anderson et al., 2018a) have been proposed as summarized in (Anderson et al., 2018b). VLN (Anderson et al., 2018a) focuses on language-grounded navigation in the real 3D environment. In order to solve the VLN task, (Anderson et al., 2018a) set up an attention-based sequence-to-sequence baseline model. Then (Wang et al., 2018) introduced a hybrid approach that combines model-free and model-based reinforcement learning (RL) to improve the model’s generalizability. Lastly, (Fried et al., 2018) proposed a speaker-follower model that adopts data augmentation, a panoramic action space and modified beam search for VLN, establishing the current state-of-the-art performance on the Room-to-Room dataset. Extending prior work, we propose a Reinforced Cross-Modal Matching (RCM) for VLN in (Wang et al., 2019). The RCM model is built upon (Fried et al., 2018) but differs in many significant aspects: (1) RCM combines a novel multi-reward RL with imitation learning for VLN while Speaker-Follower models (Fried et al., 2018) only uses supervised learning as in (Anderson et al., 2018a). (2) The RCM reasoning navigator performs cross-modal grounding rather than the temporal attention mechanism on single-modality input. (3) The RCM matching critic is similar to the Speaker in terms of the architecture design, but the former is used to provide the cycle-reconstruction intrinsic reward for both RL and SIL training while the latter is used to augment training data for supervised learning. In (Wang et al., 2019), we study how to address three critical leader-board for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problem. As shown in Fig. 18, we propose a novel Reinforced Cross-Modal Matching approach that enforces cross-modal

grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieved a new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%). Moreover, in (Wang et al., 2019) we introduce a self-supervised imitation learning method for exploration in order to explicitly address the generalization issue, which is a problem not well-studied in prior work. Concurrent to the work, (Thomason et al., 2018; Ke et al., 2019; Ma et al., 2019a,b) studies the VLN tasks from various aspects, and (Nguyen et al., 2018) introduces a variant of the VLN task to find objects by requesting language assistance when needed. Note that we are the first to propose to explore unseen environments for the VLN task.
6.2.2 实验与结果
越来越多的证据表明,近期的VLM(视觉-语言模型)和LLM(大语言模型)在符号任务规划(例如“what-to-do”)方面具有令人期待的能力。然而,每个任务都需要低层次的控制策略(例如“how-to-do”),以实现与环境的成功交互。虽然强化学习和模仿学习是通过数据驱动方式学习策略的有前途方法,但另一种有前景的方法是通过现场演示直接从人类获取策略,这种方法被称为Learning-from-Observation(Wake等人,2021a;Ikeuchi等人,0)。在本节中,我们介绍了一项研究,其中我们使用ChatGPT进行任务规划,并通过参数化赋予其可用性信息来丰富计划,从而促进有效且精确的执行(图13)。
该流程由两个模块组成:任务规划和参数化。在任务规划中,系统接收语言指令和工作环境描述。这些指令连同预定义的机器人动作集和输出规范,被编译成一个综合提示提供给ChatGPT,ChatGPT随后生成一系列分解任务及其文本描述(图13;左侧)。值得注意的是,我们采用少量示例(few-shot)方法,这意味着ChatGPT并未针对此任务进行训练,从而提供了更高的适用性,因为它消除了对硬件依赖的数据收集和模型训练的需求。此外,输出中的文本描述使用户能够检查和调整结果,这是安全和稳健操作的关键特性。图14展示了基于VirtualHome(Puig等人,2018)进行的代理模拟的定性结果。结果显示了合理的任务计划及其调整输出的灵活性,表明了我们方法的广泛适用性。
尽管任务规划器保证了任务序列之间的一致性,但在现实中成功操作需要详细的参数。例如,抓取类型对于携带容器而不洒出内容至关重要,而在模拟器中这种参数常常被忽略(见图14中抓取派的场景)。因此,在我们的机器人系统中,用户被要求对每个动作进行视觉演示(图13;右侧)。任务具有执行所需的预定义参数,这些参数由我们的视觉系统从视频中提取(Wake等人,2021b)。值得注意的是,我们的机器人系统并非设计用于精确复制人类动作(即远程操作),而是为了处理各种现实世界条件,例如物体位置的变化。因此,从人类演示中提取的参数不仅包括精确的动作路径,还包括可用性信息,这些信息指导有效的环境运动(例如,避碰的航路点(Wake等人,2023a)、抓取类型(Wake等人,2023e)以及上肢姿势(Sasabuchi等人,2021;Wake等人,2021a))。上肢姿势在高自由度机器人中至关重要,旨在假设与操作机器人共存的人类可预测的姿势。赋予可用性信息的任务序列被转化为通过强化学习获得的一系列可重复使用的机器人技能,并由机器人执行(Takamatsu等人,2022)。
通过与VLM集成,LLM赋能的任务规划可以扩展到更通用的机器人系统。这里,我们展示了一个例子,我们使用GPT-4V(ision)在多模态输入环境中扩展上述任务规划器(图15)。在这个例子中,一个人类执行的动作意图由机器人复制。本文仅显示部分提示语,完整提示语可在microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts获取。
该流程接受演示视频和文本输入,然后输出一系列机器人动作。视觉分析器旨在理解视频中人类执行的动作。我们使用了GPT-4V并提供了提示,以生成类似人类间交流风格的文本指令。图16展示了如何使用文本输入让用户对GPT-4V的识别结果进行反馈以进行修正。这一功能旨在提高识别结果的准确性,同时也实现了更稳健的操作。
接下来,场景分析器根据指令和视频数据的第一帧(或环境图像)将预期的工作环境编译为文本信息。这些环境信息包括GPT-4V识别的物体名称列表、物体的可抓取属性以及物体之间的空间关系。尽管这些计算过程在GPT-4V内部是一个黑箱,但输出的信息基于GPT-4V的知识和图像/文本输入。图17展示了我们场景分析器的示例输出。如图所示,GPT-4V成功选择了与操作相关的物体。例如,当人类在桌子上重新定位垃圾容器时,桌子包含在输出中,而冰箱开门任务中则忽略了桌子。这些结果表明,场景分析器根据人类动作编码了场景信息。我们提示GPT-4V解释物体选择过程的结果及其背后的原因。在实践中,我们发现这种方法产生了合理的结果。最后,基于给定的文本指令和环境信息,任务规划器输出一系列任务(Wake等人,2023c)。
**具身智能体用于机器人导航**
视觉-语言导航(VLN)是一项在真实3D环境中引导具身智能体执行自然语言指令的任务。在3D环境中的导航(Zhu等人,2017a;Mirowski等人,2016;Mousavian等人,2018;Hemachandra等人,2015)是物理世界中移动智能系统的一项基本能力。近年来,提出了许多任务和评估协议(Savva等人,2017;Kolve等人,2017;Song等人,2017;Xia等人,2018;Anderson等人,2018a),详见总结(Anderson等人,2018b)。VLN(Anderson等人,2018a)专注于在真实3D环境中进行语言引导的导航。为了解决VLN任务,Anderson等人(2018a)建立了一个基于注意力机制的序列到序列基线模型。然后,Wang等人(2018)引入了一种混合方法,结合无模型和基于模型的强化学习(RL),以提高模型的泛化能力。最后,Fried等人(2018)提出了一种Speaker-Follower模型,采用数据增强、全景动作空间和修改后的束搜索方法进行VLN,建立了Room-to-Room数据集上的当前最佳性能。基于先前工作,我们在(Wang等人,2019)中提出了用于VLN的强化跨模态匹配(RCM)。RCM模型基于(Fried等人,2018),但在许多重要方面有所不同:(1)RCM结合了新颖的多奖励RL与模仿学习进行VLN,而Speaker-Follower模型(Fried等人,2018)仅使用监督学习(如Anderson等人,2018a)。(2)RCM推理导航器执行跨模态接地,而不是单一模态输入的时间注意机制。(3)RCM匹配批评模块在架构设计上类似于Speaker,但前者用于为RL和SIL训练提供循环重建内在奖励,而后者用于为监督学习扩充训练数据。在(Wang等人,2019)中,我们研究了如何解决该任务中的三个关键问题:跨模态接地、不良反馈和泛化问题。如图18所示,我们提出了一种新颖的强化跨模态匹配方法,通过强化学习(RL)在局部和全局强制执行跨模态接地。特别地,使用匹配批评模块提供内在奖励以鼓励指令和轨迹之间的全局匹配,并使用推理导航器在局部视觉场景中执行跨模态接地。在VLN基准数据集上的评估显示,我们的RCM模型显著优于先前方法,在SPL指标上提高了10%,并达到了新的最佳性能。为了提高所学策略的泛化能力,我们进一步引入了一种自监督模仿学习(SIL)方法,通过模仿自身过去的良好决策探索未见过的环境。我们证明了SIL可以逼近更好且更高效的策略,极大地缩小了已见和未见环境之间的成功率性能差距(从30.7%降低到11.7%)。此外,在(Wang等人,2019)中,我们引入了一种自监督模仿学习方法进行探索,明确解决了泛化问题,这是一个在先前工作中未充分研究的问题。同时,Thomason等人(2018)、Ke等人(2019)、Ma等人(2019a,b)从不同角度研究了VLN任务,Nguyen等人(2018)引入了一种VLN任务变体,通过请求语言辅助寻找物体。需要注意的是,我们是第一个提出探索未见过环境用于VLN任务的研究团队。
(注意:LLM、VLM等重要名词保留中英文信息,括号内内容不翻译)
6.3 Healthcare (医疗健康 )
In healthcare, LLMs and VLMs can act as diagnostic agents, patient care assistants, or even therapy aids, but they come with unique leader-board and responsibilities. With the tremendous potential for AI agents to improve patient care and save lives comes an equally dangerous possibility that their misuse or hasty deployment could endanger thousands or millions of people worldwide. We discuss some of the promising routes for AI agents within the context of healthcare and also discuss some of the key leader-board faced.
Diagnostic Agents. Using LLMs as medical chatbots for patient diagnosis has recently attracted great attention due to the high-demand for medical experts and the potential for LLMs to help triage and diagnose patients (Lee et al., 2023). Dialogue agents, especially those that can effectively communicate important medical information to a broad range of people from diverse patient populations, have the potential to provide equitable healthcare access to historically disadvantaged or marginalized groups. Furthermore, doctors and healthcare systems across the world are largely over-burdened and under-resourced, resulting in insufficient access to medical care for hundreds of millions of people worldwide (World Health Organization and World Bank, 2015). Diagnostic agents provide a particularly advantageous pathway to improve healthcare for millions since they have they can be built with the capability to understand a variety of languages, cultures, and health conditions. Initial results have shown that healthcare-knowledgeable LMMs can be trained by utilizing large-scale web data (Li et al., 2023f). Although an exciting direction, the promise of diagnostic agents does not come without risks. We highlight the risks of hallucination within medical contexts, as well as potential pathways for solutions in the following section.
Knowledge Retrieval Agents. Within the medical context, model hallucinations are particularly dangerous and may even result in serious patient harm or death, depending on the severity of the error. For instance, if a patient mistakenly receives a diagnosis suggesting they are free of a condition they actually have, it can lead to catastrophic outcomes. These include postponed or inappropriate treatments, or in some cases, a total lack of necessary medical intervention. The gravity of undiagnosed or misdiagnosed conditions can lead to escalated healthcare expenses, extended therapies causing further physical strain, and in extreme scenarios, severe harm or even death. Thus, approaches that can use agents to more reliably retrieve knowledge (Peng et al., 2023) or generate text in a retrieval-based manner (Guu et al., 2020) are promising directions. Pairing a diagnostic agent with a medical knowledge retrieval agent has the potential to significantly reduce hallucinations while simultaneously improving the quality and preciseness of the responses of the diagnostic dialogue agent.
Telemedicine and Remote Monitoring. Agent-based AI also has great potential within the world of Telemedicine and Remote Monitoring by improving the access to healthcare, improving communications between healthcare providers and patients, as well as improving the efficiency and reducing the costs of frequent doctor-patient interactions (Amjad et al., 2023). Primary care clinicians spend significant amounts of time sifting through patient messages, reports, and emails that are often irrelevant or unnecessary for them to view. There is significant potential to allow for support agents to help triage messages from doctors, patients, and other healthcare providers and to help highlight important messages for all parties. By enabling agentic AI systems to coordinate with patients, clinicians, and other AI agents, there is a massive potential to revolutionize the remote healthcare and digital health industry.
6.3.1 Current Healthcare Capabilities (当前医疗能力)
Image understanding. We demonstrate the current capabilities and limitations of modern multimodal agents such as GPT-4V within the context of healthcare in Fig. 19. We can see that although GPT-4V possesses significant internal knowledge of the equipment and procedures involved in hospital care, it does not always respond to more prescriptive or diagnostic queries by the user.
Video understanding. We investigate the performance of VLM agents for medical video understanding in two contexts. First, we investigate the ability for VLM agents to identify important patient care activities in clinical spaces. Secondly, we explore the usage of of VLMs for more technical videos such as ultrasounds. Specifically, in Figure 20, we demonstrate some of the current capabilities and limitations of GPT-4V for hospital care and medical video analysis.
6.3 医疗健康
在医疗健康领域,大语言模型(LLM)和视觉-语言模型(VLM)可以作为诊断智能体、患者护理助手,甚至治疗辅助工具。然而,它们也伴随着独特的挑战和责任。人工智能(AI)代理在改善患者护理和挽救生命方面具有巨大潜力,但其误用或仓促部署也可能对全球成千上万甚至数百万人造成威胁。在本节中,我们将讨论一些医疗领域内AI代理的有前景的应用方向,并探讨其中面临的一些关键挑战。
**诊断智能体(Diagnostic Agents)**
近年来,使用LLM作为医疗聊天机器人进行患者诊断引起了广泛关注,这主要是由于医疗专家的需求量很大,而LLM具备帮助分诊和诊断患者的潜力(Lee等人,2023)。对话型智能体,特别是那些能够有效地将重要医疗信息传达给来自多样化患者群体的广泛人群的智能体,有可能为历史上处于不利地位或被边缘化的群体提供更公平的医疗服务。此外,全球范围内的医生和医疗系统普遍负担过重且资源不足,导致全球数亿人无法获得足够的医疗服务(世界卫生组织和世界银行,2015)。诊断智能体提供了一种特别有利的途径,可以通过理解多种语言、文化和健康状况来改善数百万人的医疗条件。初步结果显示,具备医疗知识的大规模语言模型(LLM)可以通过利用大规模网络数据进行训练(Li等人,2023f)。尽管这是一个令人兴奋的方向,但诊断智能体的发展并非没有风险。接下来我们将重点讨论医疗场景中的“幻觉”问题及其可能的解决方案。
**知识检索智能体(Knowledge Retrieval Agents)**
在医疗场景中,模型的“幻觉”(即生成错误或无根据的信息)尤其危险,严重时可能导致患者受到伤害甚至死亡。例如,如果患者错误地收到一个诊断结果,显示他们没有某种实际存在的疾病,这可能会导致灾难性的后果,包括延误治疗、不适当的治疗,甚至完全缺乏必要的医疗干预。未诊断或误诊的情况可能会导致医疗费用增加、治疗时间延长从而加重身体负担,在极端情况下还可能导致严重的伤害甚至死亡。因此,能够更可靠地检索知识(Peng等人,2023)或以检索为基础生成文本(Guu等人,2020)的方法是非常有前景的方向。将诊断智能体与医疗知识检索智能体结合,有望显著减少“幻觉”,同时提高诊断对话智能体的回答质量和准确性。
**远程医疗与远程监控(Telemedicine and Remote Monitoring)**
基于智能体的人工智能(AI)在远程医疗和远程监控领域也具有巨大的潜力,可以改善医疗服务的可及性,加强医护人员与患者之间的沟通,并提高频繁医患互动的效率、降低成本(Amjad等人,2023)。初级保健医生需要花费大量时间筛选患者的消息、报告和电子邮件,而这些内容往往与他们无关或没有必要查看。通过支持智能体帮助分类来自医生、患者和其他医疗人员的消息,并突出重要信息,可以极大地优化工作流程。通过使智能体系统与患者、临床医生和其他AI代理协同工作,我们有机会彻底变革远程医疗和数字健康产业。
---
### 6.3.1 当前医疗能力
**图像理解(Image Understanding)**
我们在图19中展示了现代多模态智能体(如GPT-4V)在医疗健康领域的当前能力和局限性。可以看到,虽然GPT-4V对医院护理中涉及的设备和程序具有丰富的内部知识,但它并不总是能准确响应用户的更具指导性或诊断性的查询。
**视频理解(Video Understanding)**
我们从两个角度研究了VLM智能体在医学视频理解方面的性能:首先,我们探讨了VLM智能体在临床环境中识别重要患者护理活动的能力;其次,我们探索了VLM在技术性更强的医学视频(如超声波)中的应用。具体来说,在图20中,我们展示了GPT-4V在医院护理和医学视频分析方面的当前能力和局限性。
(注:本文采用AI科普风格,尽量以通俗易懂的语言解释复杂概念,同时保留专业名词的中英文对照,括号内英文不翻译。)
6.4 Multimodal Agents
The integration of visual and linguistic understanding is crucial for developing sophisticated multimodal AI agents. This includes tasks such as image captioning, visual question answering, video language generation, and video understanding, amongst others. We aim to delve into these visual-language tasks, exploring the leader-board and opportunities they present in the context of AI agents.
视觉与语言理解的融合对于开发复杂的多模态AI智能体(Multimodal AI Agents)至关重要。这包括图像描述生成(Image Captioning)、视觉问答(Visual Question Answering)、视频语言生成(Video Language Generation)和视频理解(Video Understanding)等任务。我们旨在深入探讨这些视觉-语言任务,探索它们在AI智能体背景下的挑战和机遇。
6.4.1 Image-Language Understanding and Generation (图像-语言理解与生成)
Image-language understanding is a task that involves the interpretation of visual content in a given image with language and the generation of associated linguistic descriptions. This task is critical to the development of AI agents that can interact with the world in a more human-like manner. Some of most popular ones are image captioning (Lin et al., 2014; Sharma et al., 2018; Young et al., 2014; Krishna et al., 2016), referring expression (Yu et al., 2016; Karpathy et al., 2014), and visual question answering (Antol et al., 2015; Ren et al., 2015; Singh et al., 2019).
More recently, knowledge-intensive Visual Question Answering tasks such as OKVQA (Marino et al., 2019), KB- VQA (Wang et al., 2015), FVQA (Wang et al., 2017), and WebQA (Chang et al., 2021) have been introduced. Multimodal

agents should capable of identifying objects in an image, comprehending their spatial relationships, generating accurate descriptive sentences about the scene, and utilizing reasoning skills to handle knowledge-intensive visual reasoning. This requires not just object recognition capabilities, but also a deep understanding of spatial relationships, visual semantics, and the ability to map these visual elements to linguistic constructs with integration of the world knowledge.
6.4.1 图像-语言理解与生成
图像-语言理解是一项涉及用语言解释给定图像中的视觉内容,并生成相关语言描述的任务。这项任务对开发能够以更接近人类方式与世界互动的AI智能体至关重要。一些最流行的图像-语言任务包括图像描述生成(Image Captioning,Lin等人,2014;Sharma等人,2018;Young等人,2014;Krishna等人,2016)、指代表达(Referring Expression,Yu等人,2016;Karpathy等人,2014)和视觉问答(Visual Question Answering,Antol等人,2015;Ren等人,2015;Singh等人,2019)。
最近,知识密集型视觉问答任务如OK-VQA(Marino等人,2019)、KB-VQA(Wang等人,2015)、FVQA(Wang等人,2017)和WebQA(Chang等人,2021)被引入。多模态智能体应具备识别图像中的对象、理解其空间关系、生成准确的场景描述句子以及利用推理技能处理知识密集型视觉推理的能力。这不仅需要对象识别能力,还需要对空间关系、视觉语义有深刻理解,并能将这些视觉元素映射到语言结构中,同时结合世界知识进行整合。
6.4.2 Video and Language Understanding and Generation
Video-language generation. Video captioning or video storytelling is the task of generating a sequence of coherent sentences for a stream of video frames. Inspired by the successful use of recurrent large foundation models employed in video and language tasks, variants of agent driven enhanced models have shown promising results on the task of video-lanaguage generation. The fundamental challenge is that the strong performance of neural encoder-decoder models does not generalize well for visual storytelling, because the task requires a full understanding of the content of each image as well as the relation among different frames. One important goal for the field is to create an agent-aware text-synthesis model that can efficiently encode the sequence of frames and generate a topically coherent multi-sentence paragraph.
Video Understanding. Video understanding extends the scope of image understanding to dynamic visual content. This involves interpretation and reasoning about the sequence of frames in a video, often in conjunction with accompanying audio or textual information. An agent should be able interact with various modalities from visual, text, and also audio modalities to demonstrate their advanced comprehension of video content. Tasks in this domain include video captioning, video question answering, and activity recognition, amongst others. The leader-board in video understanding are manifold. They include the temporal alignment of visual and linguistic content, the handling of long sequences of frames, and the interpretation of complex activities that unfold over time. Regarding audio, the agent could process spoken words, background noises, music, and tone of voice to comprehend the mood, setting, and subtleties of the video content.


Previous works have focused on employing existing video-language training data available online for establishing video foundational models (Li et al., 2020, 2021b; Fu et al., 2022; Bain et al., 2021; Zellers et al., 2021, 2022; Fu et al., 2023). Supporting such training pipelines and functionalities is, however, difficult due to the limited and often inconsistent nature of these datasets. Video foundational models are designed with masked and contrastive pretraining objectives and later tuned on their respective tasks. Despite showing remarkable results in multimodal benchmarks, these models encounter difficulties in video-only tasks such as action recognition due to their dependency on limited video-text data built from noisy audio transcriptions. This limitation also leads to the lack of robustness and fine-grained reasoning skills that large language models generally possess.
Other methods, similar to those used in image-language understanding, have drawn on the strong reasoning skills and broad knowledge of large language models to improve different facets of video interpretation. The task of video understanding is simplified by language only models like ChatGPT and GPT4 or image-language models like GPT4-V, which treat the audio, video, and language modalities as individual interpretable input data types and position the agents as strong open-source models. For example, (Huang et al., 2023c; Li et al., 2023g) transformed video understanding into a natural language processing (NLP) question-answering formulation by textualizing video content with open-source vision classification/detection/caption models. (Lin et al., 2023) integrated GPT4-V with specialized tools in vision, audio, and speech, to facilitate complex video understanding tasks, such as scripting character movements and actions in long-form videos.
Parallel research explores generating scaled datasets from large models, then applying visual instruction tuning (Liu et al., 2023c; Li et al., 2023c; Zhu et al., 2023) on the generated data. Considerable audio, speech, and visual expert perception models are subsequently used to verbalize videos. Speech is transcribed with automatic speech recognition tools, and video descriptions and related data are produced with various tagging, grounding, and captioning models (Li et al., 2023g; Maaz et al., 2023; Chen et al., 2023; Wang et al., 2023f). These techniques demonstrate how instruction tuning video-language models on generated datasets may lead to enhanced video-reasoning and communication abilities.
6.4.2 视频与语言理解与生成
视频语言生成(Video-Language Generation)
视频描述生成或视频故事生成是为一系列视频帧生成连贯句子序列的任务。受视频和语言任务中成功使用的循环大规模基础模型(Recurrent Large Foundation Models)启发,增强型智能体驱动模型在视频-语言生成任务上展现了有希望的结果。然而,神经编码器-解码器模型虽然性能强大,但并不适用于视觉故事生成任务,因为该任务需要全面理解每张图像的内容以及不同帧之间的关系。该领域的一个重要目标是创建一种智能体感知的文本合成模型,能够高效编码帧序列并生成主题连贯的多句段落。
视频理解(Video Understanding)
视频理解将图像理解扩展到动态视觉内容。这包括对视频帧序列的解释和推理,通常结合音频或文本信息。智能体应能够与视觉、文本和音频等多种模态交互,展示其对视频内容的高级理解能力。该领域的任务包括视频描述生成(Video Captioning)、视频问答(Video Question Answering)和活动识别(Activity Recognition)等。视频理解面临的挑战多种多样,包括视觉与语言内容的时间对齐、长帧序列的处理以及复杂活动随时间展开的解释。关于音频,智能体可以处理语音、背景噪音、音乐和语气,以理解视频内容的情绪、场景和细微差别。
先前的研究主要集中在利用在线可用的视频-语言训练数据来建立视频基础模型(Li等人,2020,2021b;Fu等人,2022;Bain等人,2021;Zellers等人,2021,2022;Fu等人,2023)。然而,由于这些数据集通常有限且不一致,支持此类训练管道和功能较为困难。视频基础模型通过掩码和对比预训练目标设计,并在其特定任务上进行微调。尽管在多模态基准测试中表现出色,但这些模型在仅视频任务(如动作识别)中遇到困难,因为它们依赖于从嘈杂音频转录中构建的有限视频-文本数据。这种限制还导致缺乏大型语言模型通常具备的鲁棒性和细粒度推理能力。
其他方法借鉴了图像-语言理解中使用的大型语言模型的强大推理能力和广泛知识,以改进视频解释的不同方面。例如,ChatGPT和GPT4或GPT4-V等语言模型或图像-语言模型将音频、视频和语言模态视为独立可解释的输入数据类型,并将智能体定位为强大的开源模型。Huang等人(2023c)和Li等人(2023g)通过开源视觉分类/检测/描述模型将视频内容转化为文本形式,将视频理解转化为自然语言处理(NLP)问答任务。Lin等人(2023)将GPT4-V与视觉、音频和语音领域的专用工具集成,用于复杂的视频理解任务,如长视频中角色动作和运动的脚本化。
平行研究探讨了从大规模模型生成扩展数据集,并在生成的数据上应用视觉指令微调(Liu等人,2023c;Li等人,2023c;Zhu等人,2023)。随后使用大量音频、语音和视觉专家感知模型将视频转化为语言。语音通过自动语音识别工具转录,视频描述及相关数据通过各种标记、接地和描述模型生成(Li等人,2023g;Maaz等人,2023;Chen等人,2023;Wang等人,2023f)。这些技术展示了如何通过对生成数据集上的视频-语言模型进行指令微调,提升视频推理和沟通能力。
6.4.3 Experiments and Results
- Knowledge-Intensive Models: As introduced in INK (Park et al., 2022), and KAT (Gui et al., 2022a), an intensive neural knowledge task that incorporates required knowledge annotated by humans to support knowledge-intensive retrieval task.
- Multimodal-Agents: There has been a growing interest in multimodal language models like Chameleon (Lu et al., 2023) and MM-React (Yang et al., 2023c).
- Visual Instruction Tuning: VCL(Gui et al., 2022b), Mini-GPT4 (Zhu et al., 2023), MPLUG-OWL (Ye et al., 2023b), LSKD (Park et al., 2023c) generate image-level instruction tuning dataset.
- Knowledge-Intensive Agent. As showed in Fig. 22 and Fig. 23, Knowledge-based visual question answering and vision-language retrieval tasks are challenging tasks in multi-modal machine learning that requires outside knowledge beyond image contents. Recent studies on large-scale transformers have primarily focused on maximizing the efficiency of the model’s parameters to store information. This line of research explores a different aspect: whether multimodal transformers can use explicit knowledge in their decision-making process. Pretraining methods based on transformers have shown remarkable success in implicitly learning knowledge representations across multiple modalities. However, traditional methods, mainly unimodal, have investigated knowledge retrieval and subsequent answer prediction, raising questions about the quality and relevance of the knowledge retrieved and the integration of reasoning processes using both implicit and explicit knowledge. To tackle these issues, we introduce the Knowledge Augmented Transformer (KAT), which outperforms others by 6% on the 2022 OK-VQA open-domain multimodal task. KAT combines implicit knowledge from GPT3 with explicit knowledge from websites using an encoder-decoder structure, and allows for concurrent reasoning with both knowledge types during answer generation. Furthermore, incorporating explicit knowledge enhances the interpretability of the model’s predictions. The code and pre-trained models are available at https://github.com/guilk/KAT.
Vision-language Transformer Agent. Next, we introduce the "Training Vision-Language Transformers from Cap- tions" (VLC) model (Gui et al., 2022b), a transformer that has been pretrained exclusively with image-caption pairs. Despite using just a simple linear projection layer for image embeddings, VLC attains competitive results across various vision-language tasks, in contrast to other methods that depend on object detectors or supervised CNN/ViT networks.
6.4.3 实验与结果
● 知识密集型模型(Knowledge-Intensive Models):正如INK(Park等人,2022)和KAT(Gui等人,2022a)所介绍的那样,一种密集神经知识任务结合了由人类标注的知识,以支持知识密集型检索任务。
● 多模态智能体(Multimodal-Agents):近年来,人们对多模态语言模型如Chameleon(Lu等人,2023)和MM-React(Yang等人,2023c)产生了浓厚兴趣。
● 视觉指令微调(Visual Instruction Tuning):VCL(Gui等人,2022b)、Mini-GPT4(Zhu等人,2023)、MPLUG-OWL(Ye等人,2023b)和LSKD(Park等人,2023c)生成图像级指令微调数据集。
● 知识密集型智能体(Knowledge-Intensive Agent):如图22和图23所示,基于知识的视觉问答和视觉-语言检索任务是多模态机器学习中的挑战性任务,需要超出图像内容的外部知识。近期关于大规模Transformer的研究主要集中在最大化模型参数存储信息的效率。这条研究路线探索了一个不同的方面:多模态Transformer是否可以在决策过程中使用显式知识。基于Transformer的预训练方法已在隐式学习跨模态知识表示方面取得了显著成功。然而,传统方法(主要是单模态)研究了知识检索和后续答案预测,提出了关于检索知识的质量和相关性以及如何结合隐式和显式知识进行推理的问题。为解决这些问题,我们引入了知识增强Transformer(KAT),它在2022年的OK-VQA开放域多模态任务中比其他模型高出6%。KAT结合了来自GPT3的隐式知识和来自网站的显式知识,采用编码器-解码器结构,并允许在答案生成过程中同时使用两种知识类型进行推理。此外,引入显式知识增强了模型预测的可解释性。代码和预训练模型可在https://github.com/guilk/KAT获取。
视觉-语言Transformer智能体(Vision-Language Transformer Agent)
接下来,我们介绍“从标题训练视觉-语言Transformer”(VLC)模型(Gui等人,2022b),这是一种仅通过图像-标题对预训练的Transformer。尽管仅使用简单的线性投影层进行图像嵌入,VLC在各种视觉-语言任务中仍取得了具有竞争力的结果,而其他方法则依赖于对象检测器或监督CNN/ViT网络。



Figure 24: The overall architecture of the VLC model (Gui et al., 2022b). Our model consists of three modules: (1) Modality-specific projection. We use a simple linear projection to embed patched images and a word embedding layer to embed tokenized text; (2) Multi-modal encoder. We use a 12-layer ViT (Dosovitskiy et al., 2021) initialized from MAE (He et al., 2022) (ImageNet-1K without labels) as our backbone; (3) Task-specific decoder. We learn our multi-modal representations by masked image/language modeling and image-text matching which are only used during pre-training. We use a 2-layer MLP to fine-tune our multi-modal encoder for downstream tasks. Importantly, we find that the masked image modeling objective is important throughout second-stage pre-training, not only for initialization of the visual transformer.
Through extensive analysis, we explore the potential of VLC as a vision-language transformer agent. For instance, we show that VLC’s visual representations are highly effective for ImageNet-1K classification, and our visualizations confirm that VLC can accurately match image patches to corresponding text tokens. The scalability of performance with more training data highlights the promising potential for developing large-scale, weakly-supervised, open-domain vision-language models.
**VLC模型架构(Figure 24)**
我们的VLC模型由三个模块组成:
(1) **模态特定投影(Modality-specific Projection)**:我们使用简单的线性投影嵌入分块图像,并使用词嵌入层嵌入标记化的文本;
(2) **多模态编码器(Multi-modal Encoder)**:我们使用一个12层的ViT(Dosovitskiy等人,2021)作为主干,该ViT从MAE(He等人,2022)(无标签的ImageNet-1K)初始化;
(3) **任务特定解码器(Task-specific Decoder)**:我们在预训练期间通过掩码图像/语言建模和图像-文本匹配学习多模态表示,并使用两层MLP对多模态编码器进行下游任务的微调。重要的是,我们发现掩码图像建模目标在整个第二阶段预训练中都很重要,而不仅仅是用于视觉Transformer的初始化。
通过广泛的分析,我们探索了VLC作为视觉-语言Transformer智能体的潜力。例如,我们展示了VLC的视觉表示在ImageNet-1K分类任务中非常有效,我们的可视化结果确认VLC可以准确地将图像块与相应的文本标记匹配。性能的可扩展性...
6.5 Video-language Experiments
To understand the practicality of converting pre-trained image-LLMs for video understanding, we temporally expand and fine-tune InstructBLIP (Dai et al., 2023) for video captioning. Specifically, we expand the visual encoder of InstructBLIP (EVA-CLIP-G (Sun et al., 2023b)) using the same divided space-time attention scheme as Frozen in Time (Bain et al., 2021) and keep the Q-former and LLM (Flan-T5-XL (Chung et al., 2022)) frozen during training. We freeze all spatial layers of the visual encoder, while keeping the temporal layers unfrozen during captioning training. This allows for our model to take image and videos as input (matching the image-level performance of InstructBLIP). We train on a 5 million video-caption subset of WebVid10M (Bain et al., 2021). We visualize two example outputs in Figure 25. However, existing agents fail to fully comprehend precise, fine-grained visual details in the video content. A similar limitation is seen by visual instruction tuning methods, where they lack the general, human-level perception abilities that are remain to be solved by multimodal models and agents.
The instruction-tuned models show promise in accurately summarizing visible actions within videos and identifying actions like "person sitting on a bench" effectively in Fig. 25. However, they sometimes add incorrect details, such as "person smiling to the camera," revealing a shortfall in capturing conversation topics or the video’s ambiance, elements that are readily apparent to human observers. This shortfall underscores another key limitation: the omission of audio and speech modalities that would enrich the video understanding with context, aiding in more accurate interpretation and preventing such misrepresentations. Bridging this gap requires a holistic integration of available modalities, allowing multimodal agents to reach a level of comprehension akin to human perception and ensuring a fully multimodal approach to video interpretation.



Audio-Video-Language Agents with GPT-4V. We then evaluate the capabilities of GPT-4V as a multimodal agent that integrates vision, audio, and speech for a nuanced and precise understanding of videos, following the methodology outlinedin(Linetal.,2023).ResultsdepictedinFig. 26comparetheperformanceofvariousvideoagentsonthetask of video summarization. The video-instruction tuned model (Li et al., 2023g) provides accurate content but falls short on comprehensiveness and detail, missing specific actions like the methodical use of a broomstick to measure a tree’s height.
To enhance the accuracy of video descriptions, we employ GPT-4V to caption frames, while audio and its transcriptions are sourced from the OpenAI Whisper model. We then prompt GPT-4V to create video summaries using only frame captions and then using both frame captions and audio transcriptions. Initially, we observe that frame captions alone can lead to fabricated events, such as a person biting down on a stick in the third segment. These inaccuracies persist in the video summary, with descriptions like "in a playful twist, he bites down on it while holding it horizontally." Without audio input, the agent cannot correct these captioning errors, resulting in descriptions that are semantically correct but visually misleading.
However, when we provide the audio transcriptions to the agent, it manages to accurately depict the content, even capturing detailed physical actions like "holding the broomstick perpendicular to the body and rotating it downwards." This level of detail is significantly more informative and gives viewers a clearer understanding of the video’s purpose and key details. These findings highlight the importance of integrating audio, video, and language interactions to develop high-quality multimodal agents. GPT-4V emerges as a promising foundation for such advanced multimodal understanding and interaction.
Embodied Multi-modal Agents with GPT-4V. As shown in Fig. 27, We mainly used StackOverflow to get the initial Question, then we used the “Bing search" API to retrieve a related video and audio corresponding to the question. Next, we mainly use GPT-4V to get the relevant text information and high-level video description. On the other hand, we transfer the key frame audio to a low-level segment description of the key frames via ASR. Finally, we use GPT-4V to generate convincing "hallucinations" that serve as hard negative queries for video-question and answer tasks. We support interactions and question answering in the current frame of the video, as well as summarization for the overall high-level video description. During inference, we also combine external knowledge information via web search to improve answering capapbilities.
The main prompt information for GPT-4V is described as below. The entire prompt is indented for clarity; it is over one page long.
GPT-4V are an assistant to provide descriptive, informative, and full comprehensive details in the video for the visually impaired who can hear the video but cannot see. The job is to create high-quality, dense descriptions of the video by synthesizing the given annotations and output them as JSON. Specifically, GPT-4V will be given original query used to search the video, the video title, description, audio transcription, and potentially noisy descriptions for specific time in the video. Different segments of same video is annotated as "[time start - time end (in seconds)] ’text’ ". Utilize the transcriptions and descriptions all together to reason about the exact detail and visual demonstration that might be happening in the video. GPT-4V will to combine or segment the timestamps as necessary to provide the best segmentation of the video.
Expectations for GPT-4V Output:
1. Action-Oriented Descriptions: Prioritize plausible actions, motions, and physical demonstrations that the audio implies, enriching your narrative with dynamic visual cues.
2. Complete Video Coverage: Provide a continuous and consistent audio-descriptive experience that covers every moment of the video’s duration, ensuring no content is left undescribed.
3. Concise Segmentation: Construct your descriptions in focused, succinct segments of 1-2 sentences each to effectively communicate visual actions without overwhelming detail.
4. Contextual Audio-Visual Synthesis: Seamlessly blend the spoken audio content with inferred visual elements to form a narrative that reflects potential onscreen activities.
5. Imaginative and Plausible Speculation: Infuse your descriptions with creative yet believable visual details that correspond with the audio, enhancing scene comprehension.
6. Accurate Timecode Correspondence: Align your descriptive segments with corresponding time- codes, ensuring that speculative visual details synchronize with the audio narrative’s timeline.
7. Confident Narrative Delivery: Present the descriptions with assurance, as though the speculated visuals are occurring, to instill confidence in the listener.
8. Omit Implausible Details: Exclude descriptions of objects or events that do not reasonably fit within the context established by the audio and visual information provided.
The final output should be structured in a JSON format containing a list of dictionaries, each detailing a segment of the video.
The final output should be structured in a JSON format containing a list of dictionaries, each detailing a segment of the video.
[ ‘start’: <start-time-in-seconds>, ‘end’: <end-time-in-seconds>, ‘text’: “<Your detailed single-sentence, audio-visual description here>" ]
For MC Creation: our task is to create multiple-choice questions for video-to-text retrieval tasks that is trivially solved by looking at the title and reading through audio transcriptions. To do so, we will be given original query to get the video, description, audio transcription, and potentially noisy descriptions for specific time in the video.
• Formatofaudiotranscription:-[start-endtimeinseconds]“transcription"
• Format of noisy description: - [time in seconds] “description"
We kindly ask GPT-4V to generate four queries, where the primary query is aligned with the video content, and the other three negatives are subtly different from our primary one. Selecting the primary one should not simply involve listening to audio transcriptions e.g. the text original query is contained in audio transcriptions. The negatives should be closely related but not fully aligned with the video content, requiring visual understanding of the video to differentiate. For example, modify the semantics in nuanced way so that one needs to watch the video than just listening to select the original query. Compile four queries in caption-like statement, with the first one being the rephrased original.
Think step by step how you can come up with negative statements using the information from the video. And justify the negative queries are incorrect but still compelling choices that demand nuanced understanding of the video. And how humans would not accidentally choose the negatives over the original query.
Finally, we present the work in the following format of analyses and 4 queries. No need to generate how you translated the original query.
• Video Analysis: xxx
• Queries: [query1, query2, query3, query4] • Justification: xxx
### 6.5 视频-语言实验
为了理解将预训练图像-LLM转换为视频理解的实用性,我们对InstructBLIP(Dai等人,2023)进行了时间扩展和微调,用于视频描述生成任务。具体来说,我们使用与Frozen in Time(Bain等人,2021)相同的分隔时空注意力方案扩展了InstructBLIP的视觉编码器(EVA-CLIP-G(Sun等人,2023b)),并在训练过程中冻结了Q-former和LLM(Flan-T5-XL(Chung等人,2022))。我们在描述生成训练期间冻结了视觉编码器的所有空间层,同时保留了时间层未冻结。这使得我们的模型可以接受图像和视频作为输入(匹配InstructBLIP的图像级性能)。我们在WebVid10M(Bain等人,2021)的500万视频-描述子集上进行训练。图25展示了两个示例输出。然而,现有的智能体无法完全理解视频内容中精确、细粒度的视觉细节。类似的限制也出现在视觉指令微调方法中,它们缺乏一般的人类水平感知能力,这是多模态模型和智能体仍需解决的问题。
经过指令微调的模型在准确总结视频中的可见动作方面显示出潜力,例如在图25中有效识别“一个人坐在长椅上”的动作。然而,它们有时会添加错误的细节,例如“人对着镜头微笑”,这揭示了捕捉对话主题或视频氛围方面的不足,而这些元素对人类观察者来说是显而易见的。这一不足突显了另一个关键限制:缺乏音频和语音模态,这会使视频理解更加丰富,并通过提供上下文帮助更准确地解释视频内容,避免此类误解。弥合这一差距需要全面整合可用的模态,使多模态智能体达到类似于人类感知的理解水平,并确保对视频解释采取完全多模态的方法。
---
### 音频-视频-语言智能体与GPT-4V
接下来,我们评估了GPT-4V作为一种多模态智能体的能力,它整合了视觉、音频和语音,以实现对视频的细致和精确理解,遵循Lin等人(2023)概述的方法。图26所示的结果比较了各种视频智能体在视频摘要任务上的表现。视频指令微调模型(Li等人,2023g)提供了准确的内容,但在全面性和细节方面有所欠缺,例如遗漏了使用扫帚测量树高这种具体动作。
为了提高视频描述的准确性,我们使用GPT-4V为帧生成标题,同时从OpenAI Whisper模型获取音频及其转录。然后,我们提示GPT-4V仅使用帧标题创建视频摘要,随后结合帧标题和音频转录生成摘要。最初,我们观察到单独使用帧标题可能导致虚构事件,例如在第三段中“一个人咬住一根棍子”。这些不准确性持续存在于视频摘要中,描述如“在一个有趣的转折中,他水平握住它并咬下去”等语句虽然语义正确,但视觉上具有误导性。没有音频输入,智能体无法纠正这些标题错误,导致描述在语义上正确但在视觉上误导。
然而,当我们向智能体提供音频转录时,它能够准确描绘内容,甚至捕捉到详细的物理动作,例如“将扫帚垂直于身体并向下旋转”。这种细节层次显著提高了信息量,使观众对视频的目的和关键细节有更清晰的理解。这些发现强调了整合音频、视频和语言交互的重要性,以开发高质量的多模态智能体。GPT-4V成为这种高级多模态理解和交互的有前途的基础。
---
### 具身多模态智能体与GPT-4V
如图27所示,我们主要使用StackOverflow获取初始问题,然后使用“必应搜索”API检索与问题相关的视频和音频。接下来,我们主要使用GPT-4V获取相关文本信息和高层次的视频描述。另一方面,我们通过ASR将关键帧音频转化为低层次的关键帧段描述。最后,我们使用GPT-4V生成可信的“幻觉”,作为视频问答任务中的困难负查询。我们支持当前视频帧内的交互和问答,以及整体高层次视频描述的总结。在推理过程中,我们还通过网络搜索结合外部知识信息以提高回答能力。
GPT-4V的主要提示信息如下。整个提示信息为了清晰起见进行了缩进,长度超过一页。
GPT-4V的任务是为视障人士提供视频的详细、信息丰富且全面的描述,这些人可以听到视频但无法看到。具体来说,GPT-4V的工作是通过综合给定的注释生成高质量、密集的视频描述,并以JSON格式输出。GPT-4V将获得用于搜索视频的原始查询、视频标题、描述、音频转录,以及视频特定时间点的可能带有噪声的描述。同一视频的不同片段被标注为“[开始时间 - 结束时间(秒)] '文本' ”。结合所有转录和描述,推断视频中可能发生的精确细节和视觉演示。GPT-4V将根据需要合并或分割时间戳,以提供最佳的视频分割。
6.6 Agent for NLP (自然语言处理的智能体)
6.6.1 LLM agent (大语言模型(LLM)智能体)
Recognizing task directives and taking action has been a fundamental challenge in interactive AI and natural language
processing for decades. With the recent advances in deep learning, there is a growing interest in studying these areas jointly to improve human-agent collaboration. We identify three specific directions, among others, to improve language-grounded agents:
• Tool use and querying from knowledge bases. This direction emphasizes the importance of integrating external knowledge bases, web search, or other helpful tools into the reasoning processes of AI agents. By leveraging structured and unstructured data from various sources, agents can enhance their understanding and provide more accurate and context-aware responses. Furthermore, it fosters the agent’s ability to proactively seek out information when faced with unfamiliar scenarios or queries, ensuring more comprehensive and informed responses. Examples include Toolformer (Schick et al., 2023) and Retrieve What You Need (Wang et al., 2023g).
• Improved agent reasoning and planning. Enhancing the agent’s ability to reason and plan is pivotal for effective human-agent collaboration. This involves the development of models that can understand complex instructions, infer user intentions, and predict potential future scenarios. This can be accomplished by asking the agent to reflect on past actions and failures as in ReAct (Yao et al., 2023a), or by structuring the agent thought process as a form of search (Yao et al., 2023b). By simulating different outcomes and assessing the ramifications of various actions, agents can make more informed context-aware decisions.
• Incorporating system and human feedback. AI agents can frequently operate in two primary contexts: environments that provide explicit signals about the effectiveness of their actions (system feedback), and settings where they collaborate with humans who can offer verbal critiques (human feedback). This direction underscores the need for adaptive learning mechanisms that allow agents to refine their strategies and rectify mistakes, such as in AutoGen (Wu et al., 2023). The ability to continuously learn and adapt from diverse feedback sources ensures that agents remain helpful and aligned for user needs.
### 6.6 自然语言处理的智能体
#### 6.6.1 大语言模型(LLM)智能体
识别任务指令并采取行动,是交互式人工智能和自然语言处理领域几十年来的基本挑战。随着深度学习的最新进展,越来越多的研究兴趣集中在这些领域的联合研究上,以改善人与智能体之间的协作。我们确定了三个具体方向(以及其他方向),以改进基于语言的智能体:
• **工具使用与知识库查询**
这一方向强调将外部知识库、网络搜索或其他有用工具集成到AI智能体的推理过程中的重要性。通过利用来自各种来源的结构化和非结构化数据,智能体可以增强其理解能力,并提供更准确且上下文相关的响应。此外,这促进了智能体在面对不熟悉的情境或查询时主动寻求信息的能力,从而确保更全面和知情的响应。例如,Toolformer(Schick等人,2023)和Retrieve What You Need(Wang等人,2023g)。
• **改进智能体的推理与规划能力**
提升智能体的推理和规划能力,对于有效的人机协作至关重要。这包括开发能够理解复杂指令、推断用户意图并预测潜在未来情境的模型。可以通过让智能体反思过去的行动和失败(如ReAct,Yao等人,2023a),或者将智能体的思维过程结构化为一种搜索形式(如Yao等人,2023b)来实现。通过模拟不同的结果并评估各种行动的影响,智能体可以做出更明智且上下文感知的决策。
• **整合系统反馈与人类反馈**
AI智能体通常在两种主要环境中运行:提供关于其行动有效性明确信号的环境(系统反馈),以及与人类协作并能获得口头批评的环境(人类反馈)。这一方向强调需要适应性学习机制,使智能体能够完善其策略并纠正错误,例如AutoGen(Wu等人,2023)。从多样化的反馈来源中持续学习和适应的能力,确保智能体始终保持对用户需求的帮助性和一致性。
6.6.2 General LLM agent (通用大语言模型(LLM)智能体)
Recognizing and understanding agent content and natural language has been a fundamental challenge in interactive AI and natural language processing for decades. With the recent advances in deep learning, there is a growing interest

Figure 28: The training recipe used to train the Alpaca model (Taori et al., 2023). At a high level, existing LLMs are used to generate a large pool of instruction-following examples from a smaller set of seed tasks. The generated instruction-following examples are then used to instruction-tune an LLM where the underlying model weights are available.
in studying these two areas jointly for deep understanding of both agent planning or human feedback for knowledge- inference and natural language generation. These are the key components of many human-machine-interaction agents, such as “AutoGen"(Wu et al., 2023) and “Retrieve What You Need"(Wang et al., 2023g).
6.6.2 通用大语言模型(LLM)智能体
识别和理解智能体内容及自然语言,是交互式人工智能和自然语言处理领域几十年来的基本挑战。随着深度学习的最新进展,越来越多的研究兴趣集中在联合研究这两个领域,以深入理解智能体规划、人类反馈在知识推理中的作用以及自然语言生成。这些是许多人机交互智能体(如“AutoGen”(Wu等人,2023)和“Retrieve What You Need”(Wang等人,2023g))的关键组成部分。
图28:用于训练Alpaca模型的训练方法(Taori等人,2023)
从宏观上看,现有的大语言模型(LLM)被用来从一组较小的种子任务中生成大量的指令跟随示例。然后,这些生成的指令跟随示例被用于微调一个具有可用底层模型权重的大语言模型。
6.6.3 Instruction-following LLM agents (指令跟随的大语言模型(LLM)智能体)
Furthermore, the creation of LLM Agents that can be trained to effectively follow human instructions has become an important area of research. Initial models used human feedback to train a proxy reward model to simulate human preferences, through a process known as Reinforcement Learning with Human Feedback (RLHF) (Ouyang et al., 2022). This process produced models such as InstructGPT and ChatGPT. In order to more efficiently train instruction-following LLM agents without needing human labels, researchers developed a more efficient method for instruction-tuning that trains the LLM agent directly on instruction/response pairs, either generated by humans like Dolly 2.0 6 or automatically from LLMs like Alpaca (Taori et al., 2023). We show the overall Alpaca training pipeline in Figure 28.
6.6.3 指令跟随的大语言模型(LLM)智能体
此外,创建能够有效遵循人类指令的大语言模型智能体已成为一个重要研究领域。最初的模型通过人类反馈训练了一个代理奖励模型来模拟人类偏好,这一过程被称为基于人类反馈的强化学习(RLHF)(Ouyang等人,2022)。该过程产生了诸如InstructGPT和ChatGPT等模型。为了更高效地训练指令跟随的LLM智能体而不依赖人工标签,研究人员开发了一种更高效的方法,直接在指令/响应对上训练LLM智能体,这些对可以由人类生成(如Dolly 2.0),也可以由LLM自动生成(如Alpaca,Taori等人,2023)。图28展示了Alpaca的整体训练流程。
6.6.4 Experiments and Results (实验与结果)
Despite the growing adoption of conversational and self-feedback systems, these forms of AI still do not perform well with regard to generating factually correct responses from their own implicit knowledge and therefore often use external tools like web search and knowledge retrieval mechanisms at inference-time to augment their response as a consequence. Addressing this would help create more engaging experiences for users in many real-life applications. In social conversations (such as those on social media platforms like Instagram and Facebook), or with Q+A websites (such as Ask or Quora), people usually engage with others through a series of comments and by web-searching for information and knowledge relevant to the discussion. Thus, the task of generating conversational turns in this context is not to simply bootstrap upon traditional NLP models and tasks, but to use agents to generate dialogue through intelligent behaviors that reflect knowledge search and acquisition (Peng et al., 2023). In this way, intelligent agents for

NLP tasks extends the task description and improves upon the interpretability of the response by adding an explicit knowledge search and retrieval step during dialogue. Incorporating these web search and retrieval agents as feedback during dialogue will help to engage further and deeper the social interactions between humans and agents (Wang et al., 2023e). As the Fig 29 showed, we introduced a new modeling paradigm for transformer language models that detects and extracts important logical structures and information from input texts and then integrates them into the input embeddings through carefully designed multi-layer hierarchical logical projections to infuse logical structures into pre-trained language models as one kind of NLP agent. (Wang et al., 2023e) propose a novel approach to construct logic-aware input embeddings for transformer language models through a combination of logic detection, logic mapping and hierarchical logical projections, and then develop a corresponding new modeling paradigm that can upgrade all existing transformer language models into logical transformers to consistently boost their performance. The proposed logical transformer agent consistently achieve superior performance over their baseline transformer models through a deeper understanding of the logical structures of texts. To human users, it is often these aspects that are more important for delivering a meaningful and interesting conversation via a agent-based coordination between dialogue and information retrieval. Delving deep into natural language processing, this topic will discuss the advancements and leader-board in making LLMs more agentic and better suited for various language-centered tasks.
An open-domain question answering (QA) system usually follows a retrieve-then-read paradigm, in which a retriever is used to retrieve relevant passages from a large corpus, and then a reader generates answers based on the retrieved passages and the original question. In (Wang et al., 2023g), we propose a simple and novel mutual learning framework to improve the performance of retrieve-then-read-style models via an intermediate module named the knowledge selector agent, which we train with reinforcement learning. The fine-grained knowledge selector into the retrieve-then- reader paradigm, whose goal is to construct a small subset of passages which retain question-relevant information. As showed in Figure 30, The knowledge selector agent is trained as a component of our novel mutual learning framework, which iteratively trains the knowledge selector and the reader. We adopt a simple and novel approach employing policy gradients to optimize the knowledge selector agnet, using feedback from the reader to train it to select a small and
informative set of passages. This approach avoids brute-force search or manually-designed heuristics, without requiring any annotated query-document pairs for supervision. We show that iteratively training the reader and the knowledge selector agent leads to better predictive performance on some public open-domain question answer benchmarks.

### 6.6.4 实验与结果
尽管对话系统和自我反馈系统的应用日益广泛,但这些形式的人工智能在仅依靠自身隐性知识生成事实正确的响应方面仍然表现不佳,因此通常在推理时使用外部工具(如网络搜索和知识检索机制)来增强其响应。解决这一问题将有助于在许多实际应用中为用户提供更吸引人的体验。在社交对话(如Instagram和Facebook等社交媒体平台上的对话)或问答网站(如Ask或Quora)中,人们通常通过一系列评论以及网络搜索相关信息和知识来与其他用户互动。因此,在这种情境下生成对话的任务不仅不能简单依赖传统的NLP模型和任务,还需要利用智能体通过反映知识搜索和获取的智能行为生成对话(Peng等人,2023)。通过这种方式,用于NLP任务的智能体通过在对话过程中显式添加知识搜索和检索步骤,扩展了任务描述并提高了响应的可解释性。将这些网络搜索和检索智能体作为对话中的反馈,将进一步加深人类与智能体之间的社交互动(Wang等人,2023e)。如图29所示,我们为Transformer语言模型引入了一种新的建模范式,该范式检测并提取输入文本中的重要逻辑结构和信息,并通过精心设计的多层次逻辑投影将其整合到输入嵌入中,从而将逻辑结构注入预训练的语言模型作为一种NLP智能体。(Wang等人,2023e)提出了一种新颖的方法,通过逻辑检测、逻辑映射和层次逻辑投影的组合,为Transformer语言模型构建逻辑感知的输入嵌入,并开发了相应的新型建模范式,可以将所有现有的Transformer语言模型升级为逻辑Transformer,从而持续提升其性能。所提出的逻辑Transformer智能体通过对文本逻辑结构的更深层次理解,始终优于其基线Transformer模型。对于人类用户而言,这些方面往往是通过智能体协调对话与信息检索,实现有意义且有趣的对话的关键所在。深入探讨自然语言处理,本主题将讨论使LLM更具智能性和更适合各种语言中心任务的进展和排行榜。
---
### 开放域问答系统
开放域问答(QA)系统通常遵循“检索-阅读”范式,其中检索器用于从大型语料库中检索相关段落,然后阅读器根据检索到的段落和原始问题生成答案。在(Wang等人,2023g)中,我们提出了一种简单而新颖的互学习框架,通过名为知识选择器智能体的中间模块,利用强化学习对其进行训练,从而提升“检索-阅读”风格模型的性能。细粒度的知识选择器被集成到“检索-阅读”范式中,其目标是构建一个小的段落子集,保留与问题相关的信息。如图30所示,知识选择器智能体作为我们新颖互学习框架的一部分进行训练,该框架迭代训练知识选择器和阅读器。我们采用了一种简单而新颖的方法,使用策略梯度优化知识选择器智能体,利用阅读器的反馈训练其选择少量且信息丰富的段落。这种方法避免了暴力搜索或手动设计的启发式方法,无需任何标注的查询-文档对进行监督。我们证明,迭代训练阅读器和知识选择器智能体可以提高某些公开开放域问答基准测试的预测性能。
7 Agent AI Across Modalities, Domains, and Realities (代理人工智能跨模态、领域和现实)
7.1 Agents for Cross-modal Understanding(跨模态理解的代理)
Multi-modal understanding is a significant challenge for creating generalist AI agents due to the lack of large-scale datasets that contain vision, language, and agent behavior. More generally, training data for AI agents is often modality specific. This results in most modern multi-modal systems using a combination of frozen submodules. Some notable examples are Flamingo (Alayrac et al., 2022), BLIP-2 (Li et al., 2023c), and LLaVA (Liu et al., 2023c), all of which utilize a frozen LLM and frozen visual encoder. These submodules are trained individually on separate datasets, and then adaptation layers are trained to encode the visual encoder into the LLM embedding space. In order to make further progress for cross-modal understanding for AI agents, it is likely that the strategy of using frozen LLMs and visual encoders will need to change. Indeed, RT-2, a recent visual-language model that is capable of taking actions within the domain of robotics showed significantly improved performance when jointly tuning the visual encoder and LLM for robotics and visual-language tasks (Brohan et al., 2023).
7 代理人工智能跨模态、领域和现实(Agent AI Across Modalities, Domains, and Realities)
7.1 跨模态理解的代理(Agents for Cross-modal Understanding)
多模态理解(Multi-modal understanding)是创建通用人工智能代理(generalist AI agents)的重大挑战,原因在于缺乏包含视觉、语言和代理行为的大规模数据集(large-scale datasets)。更广泛地说,人工智能代理(AI agents)的训练数据通常特定于某种模态(modality specific)。这导致大多数现代多模态系统(multi-modal systems)使用冻结子模块(frozen submodules)的组合。一些值得注意的例子包括 Flamingo (Alayrac et al., 2022)、BLIP-2 (Li et al., 2023c) 和 LLaVA (Liu et al., 2023c),它们都利用了冻结的语言模型(frozen LLM)和冻结的视觉编码器(frozen visual encoder)。这些子模块(submodules)在单独的数据集上进行训练,然后训练适配层(adaptation layers),将视觉编码器(visual encoder)编码到语言模型嵌入空间(LLM embedding space)。为了在人工智能代理的跨模态理解(cross-modal understanding)方面取得进一步进展,很可能需要改变使用冻结语言模型和视觉编码器(frozen LLMs and visual encoders)的策略。事实上,RT-2 是一种最近的视觉-语言模型(visual-language model),能够在机器人领域内执行动作,在联合微调视觉编码器和语言模型以完成机器人和视觉-语言任务时,其性能显著提高(Brohan et al., 2023)。
7.2 Agents for Cross-domain Understanding(跨领域理解的代理)
A key challenge for creating generalist agents is the distinctive visual appearance and disparate action spaces across different domains. Humans possess the capability to interpret images and videos from various sources, including the real world, video games, and specialized domains such as robotics and healthcare, once they become familiar with the specific details of these areas. However, existing LLMs and VLMs often demonstrate significant differences between the data they were trained on and the varied domains in which they are applied. And notably, training agent models to predict specific actions presents a considerable challenge when trying to develop a single policy that can effectively learn multiple control systems across domains. Generally, the approach most modern works take when applying systems within specific domains is to start from a pretrained foundation model and then finetune a separate model for each specific domain. This fails to capture any commonalities between domains and results in a smaller total set of data used for training instead of leveraging each domain’s data.
7.2 跨领域理解的代理(Agents for Cross-domain Understanding)
创建通用代理(generalist agents)的一个关键挑战是不同领域之间具有独特的视觉特征和差异化的动作空间(disparate action spaces)。人类具备解读来自各种来源的图像和视频的能力,包括现实世界、电子游戏以及机器人和医疗保健等专业领域,一旦熟悉这些领域的具体细节即可。然而,现有的语言模型(LLMs)和视觉语言模型(VLMs)通常在其训练数据与应用的多样化领域之间表现出显著差异。值得注意的是,当尝试开发一种能够在不同领域中有效学习多个控制系统的单一策略(policy)时,训练代理模型以预测特定动作(specific actions)会带来相当大的挑战。一般来说,现代研究在将系统应用于特定领域时采用的方法是从预训练的基础模型(pretrained foundation model)开始,然后为每个特定领域微调一个单独的模型(finetune a separate model)。这种方法无法捕捉领域之间的共性,并导致用于训练的总数据量减少,而不是充分利用每个领域的数据。
7.3 Interactive agent for cross-modality and cross-reality (跨模态与跨现实的交互式代理)
Developing AI agents that can successfully understand and perform tasks across different realities is an on-going challenge that has seen some recent success for image and scene generation (Huang et al., 2023a). In particular, it is challenging for agents to simultaneously understand real-world and virtual reality environments due to their visual dissimilarities and separate environment physics. Within the context of cross-reality, Sim to Real transfer is a particularly important problem when using simulation-trained policies for real-world data, which we discuss in the next section.
7.3 跨模态与跨现实的交互式代理(Interactive agent for cross-modality and cross-reality)
开发能够成功理解和执行跨不同现实任务的人工智能代理(AI agents)是一项持续的挑战,最近在图像和场景生成(image and scene generation)方面取得了一些成功(Huang et al., 2023a)。特别是,由于现实世界和虚拟现实环境(real-world and virtual reality environments)在视觉上的差异以及环境物理规律的不同,代理同时理解这两种环境具有很大挑战性。在跨现实(cross-reality)的背景下,当使用基于模拟训练的策略(simulation-trained policies)处理现实世界数据时,仿真到现实的迁移(Sim to Real transfer)是一个特别重要的问题,我们将在下一节中讨论。
7.4 Sim to Real Transfer (仿真到现实的迁移)
Techniques which enable models trained in simulation to be deployed in the real world. Embodied agents, especially one based on RL policies, are typically trained in simulated environments. These simulations do not fully replicate the characteristics of the real world (e.g., disturbances, light, gravity, and other physical properties). Due to this discrepancy between simulation and reality, models trained in simulation often struggle to perform well when applied in the real world. This issue is known as the “sim-to-real” problem. To solve this problem, several approaches can be taken:
- Domain randomization: domain randomization is a technique that trains a model while randomly varying parameters within a simulation environment (e.g., object appearance, sensor noise, and optical properties) in anticipation of the uncertainties and variations of the real world (Tobin et al., 2017). For instance, in the context of training a RL-based grasping skills, introducing randomness in the shapes of objects can lead to a policy capable of adapting to objects with somewhat different shapes (Saito et al., 2022).
- Domain adaptation: Domain adaptation, or domain transfer is a technique that bridges the gap between simulated and real-world domains by training models with a large number of simulated images and a smaller set of real-world images. In practical settings, unpaired image-to-image translation methods such as Cy- cleGAN (Zhu et al., 2017b) are employed due to the difficulty in preparing paired images across domains. Several enhanced versions exist for reinforcement learning, including RL-CycleGAN (Rao et al., 2020), and for imitation learning, such as RetinaGAN (Ho et al., 2021).
- Improvement of simulation: Realistic simulation is a key for sim-to-real transfer. Part of this effort is achieved by a system identification techniques (Zhu et al., 2017c; Allevato et al., 2020), which aims to identify simulation parameters to mimic the real-world environments. Additionally, use of photorealistic simulators would be effective in image-based reinforcement learning (Martinez-Gonzalez et al., 2020; Müller et al., 2018; Shah et al., 2018; Sasabuchi et al., 2023).
- The sim-to-real transfer remains a central challenge in the study of Embodied Agents, as approaches keep evolving. Both theoretical and empirical research are essential to advance these technologies further.
7.4 仿真到现实的迁移(Sim to Real Transfer)
使在仿真环境中训练的模型能够部署到现实世界的技术。具身代理(Embodied agents),尤其是基于强化学习策略(RL policies)的代理,通常在模拟环境中进行训练。这些模拟环境无法完全复制现实世界的特性(例如,干扰、光线、重力和其他物理属性)。由于仿真与现实之间的这种差异,仿真环境中训练的模型在应用于现实世界时往往表现不佳。这一问题被称为“仿真到现实”问题(sim-to-real problem)。为了解决这个问题,可以采取以下几种方法:
● **领域随机化(Domain randomization)**:领域随机化是一种技术,在模拟环境中训练模型时,通过随机变化参数(例如,物体外观、传感器噪声和光学特性)来应对现实世界的不确定性和变化(Tobin et al., 2017)。例如,在基于强化学习的抓取技能训练中,引入物体形状的随机性可以使策略适应形状略有不同的物体(Saito et al., 2022)。
● **领域适应(Domain adaptation)**:领域适应或领域迁移是一种技术,通过使用大量模拟图像和少量真实世界图像训练模型,弥合模拟域与现实世界域之间的差距。在实际应用中,由于跨域准备配对图像的难度较大,通常使用无配对图像到图像翻译方法,如 CycleGAN(Zhu et al., 2017b)。在强化学习领域有若干增强版本,例如 RL-CycleGAN(Rao et al., 2020),以及在模仿学习领域的 RetinaGAN(Ho et al., 2021)。
● **改进仿真(Improvement of simulation)**:逼真的仿真对于仿真到现实的迁移至关重要。部分努力是通过系统辨识技术实现的(Zhu et al., 2017c; Allevato et al., 2020),该技术旨在识别仿真参数以模拟现实世界环境。此外,在基于图像的强化学习中使用照片级真实的模拟器(photorealistic simulators)也将非常有效(Martinez-Gonzalez et al., 2020; Müller et al., 2018; Shah et al., 2018; Sasabuchi et al., 2023)。
● 仿真到现实的迁移仍然是具身代理研究中的核心挑战,随着方法的不断演进,理论研究和实证研究对于推动这些技术进一步发展都至关重要。
8 Continuous and Self-improvement for Agent AI (持续改进与自我提升的代理人工智能)
Currently, foundation model based AI agents have the capacity to learn from multiple different data sources, which allow for more flexible sources for data for training. Two key consequences of this are (1) user and human-based interaction data can be used to further refine and improve the agent and (2) existing foundation models and model artifacts can be used to generate training data. We discuss each of these in more detail in the following sections, but we note that since current AI Agents are largely tied to existing pretrained foundation models, they generally do not learn from continuous interaction with their environments. We think this is an exciting future direction, and initial work by Bousmalis et al. has shown that self-improving agents for robotic control are able to continuous learn and improve through environmental interactions without supervision (Bousmalis et al., 2023).
8.1 Human-based Interaction Data (基于人类的交互数据)
The core idea behind using human-based interaction data is to leverage a large number of of agent-human interactions to train and improve future iterations of the agent. There are several strategies used to improve agents from human-agent interactions.
• Additional training data Perhaps the simplest usage of human-agent interactions is to use the interaction examples themselves as training data for a future iteration of the agent. This generally requires filtering strategies to differentiate successful agent examples from unsuccessful interaction examples. Filtering can be rules-based (e.g., reaching some desired end goal state), model-based (e.g., classifying successful vs unsuccessful interactions), or manually selected after a posthoc inspection and/or modification of the interaction examples.
- Human preference learning During interaction with the user, the agent system can prompt the user with several different model outputs and allow for the user to select the best output. This is commonly used by LLMs like ChatGPT and GPT-4, whereby users can select one output (out of several) that aligns best with their preferences.
- Safety training (red-teaming) Red-teaming within the context of Agent AI refers to having a dedicated team of adversaries (either human or computer) that seek to exploit and expose weaknesses and vulnerabilities within the Agent AI system. Although adversarial in nature, red-teaming is commonly used as a means for understanding how to improve AI safety measures and reduce the occurrence of harmful outputs. The core principle is to discover consistent methods for inducing unwanted agent outputs so that the model can be trained on data that explicitly corrects this behavior.
8.2 Foundation Model Generated Data (基础模型生成的数据)
With the advent of powerful foundation model artifacts produced by academia and industry, there have been a variety of methods developed to extract and generate meaningful training data from these artifacts using a variety of prompting and data-pairing techniques.
- LLMInstruction-tuning
Methods forgenerating instruction-following training data from LLM shaveallowed for the finetuning of smaller, open-source models based on the outputs of larger proprietary LLMs (Wang et al., 2022b). For example, Alpaca (Taori et al., 2023) and Vicuna (Zheng et al., 2023) are LLMs based on the open-source LLaMA family (Touvron et al., 2023) that have been tuned on various outputs from ChatGPT and human participants. This method of instruction tuning can be viewed as a form of knowledge distillation, where the larger LLM serves as a teacher model to a smaller student model. Importantly, although LLM instruction-tuning has been shown to transfer the writing style and some instruction-following capabilities of the teacher model to the student model, significant gaps still exist between the factuality and capabilities of the teacher and student models (Gudibande et al., 2023).
- Vision-language pairs
A number of recent work shave sought to increase the number of diversity of pretraining data available to visual-language models by automatically generating captions and other text for visual content. For example, LLaVA (Liu et al., 2023c) uses 150,000 examples of instruction-following behavior from textual and visual inputs that are mainly LLM-generated. Other work has shown that using VLMs to re-caption images can improve the training data and subsequent quality of image generation models (Segalis et al., 2023). Within the realm of video understanding, using VLMs and LLMs to recaption videos has been shown to improve the performance and quality of subsequent VLMs trained on the recaptioned videos (Wang et al., 2023f; Zhao et al., 2022).
8 持续改进与自我提升的代理人工智能(Continuous and Self-improvement for Agent AI)
目前,基于基础模型的AI代理具有从多个不同数据源学习的能力,这使得训练数据来源更加灵活。由此产生的两个关键结果是:(1) 用户和基于人类的交互数据可以用于进一步优化和改进代理;(2) 现有的基础模型和模型工件可以用于生成训练数据。我们在以下章节中详细讨论每一点,但需要注意的是,由于当前的AI代理主要依赖于现有的预训练基础模型,它们通常无法通过与环境的持续交互进行学习。我们认为这是一个令人兴奋的未来方向,Bousmalis等人最初的研究表明,用于机器人控制的自改进代理能够通过与环境的交互实现无监督的持续学习和改进(Bousmalis et al., 2023)。
---
### 8.1 基于人类的交互数据(Human-based Interaction Data)
使用基于人类的交互数据的核心思想是利用大量的代理-人类交互来训练和改进未来的代理迭代版本。以下是几种通过人类-代理交互改进代理的策略:
• **额外训练数据**
也许最简单的利用人类-代理交互的方式是将交互示例本身作为未来迭代代理的训练数据。这通常需要过滤策略来区分成功的代理示例与不成功的交互示例。过滤可以基于规则(例如,达到某些期望的最终目标状态)、基于模型(例如,分类成功与不成功的交互),或在交互示例经过事后检查和/或修改后手动选择。
● **人类偏好学习(Human preference learning)**
在与用户交互过程中,代理系统可以向用户提供多个不同的模型输出,并允许用户选择最佳输出。这种方法常被像ChatGPT和GPT-4这样的语言模型(LLMs)使用,用户可以选择一个(在多个选项中)最符合其偏好的输出。
● **安全性训练(红队测试,Red-teaming)**
在代理人工智能的背景下,红队测试指的是拥有一支专门的对抗团队(可以是人类或计算机),试图利用和暴露代理AI系统中的弱点和漏洞。尽管本质上是对抗性的,红队测试通常被用作一种理解如何改进AI安全措施并减少有害输出发生的方法。核心原则是发现一致的方法来诱导不良代理输出,以便模型可以在明确纠正这种行为的数据上进行训练。
---
### 8.2 基础模型生成的数据(Foundation Model Generated Data)
随着学术界和工业界生产出强大的基础模型工件,已经开发出多种方法,通过各种提示技术和数据配对技术从这些工件中提取和生成有意义的训练数据。
● **LLM指令微调(LLMInstruction-tuning)**
从大型语言模型(LLMs)生成指令跟随训练数据的方法已经允许对较小的开源模型进行微调,这些模型基于较大专有LLMs的输出(Wang et al., 2022b)。例如,Alpaca(Taori et al., 2023)和Vicuna(Zheng et al., 2023)是基于开源LLaMA系列(Touvron et al., 2023)的语言模型,它们通过来自ChatGPT和人类参与者的各种输出进行了调整。这种方法可以被视为一种知识蒸馏形式,其中较大的LLM充当教师模型,较小的模型则作为学生模型。重要的是,尽管LLM指令微调已被证明可以将教师模型的写作风格和一些指令跟随能力转移到学生模型上,但在事实性和能力方面,教师模型和学生模型之间仍然存在显著差距(Gudibande et al., 2023)。
● **视觉-语言对(Vision-language pairs)**
最近的一些工作试图通过自动为视觉内容生成标题和其他文本,增加可用于视觉-语言模型的预训练数据的多样性和数量。例如,LLaVA(Liu et al., 2023c)使用了15万条主要由LLM生成的文本和视觉输入指令跟随行为示例。其他研究还表明,使用视觉语言模型(VLMs)重新为图像生成标题可以提高训练数据的质量以及后续图像生成模型的表现(Segalis et al., 2023)。在视频理解领域,使用VLMs和LLMs重新为视频生成标题已被证明可以提高基于重新生成标题训练的后续VLMs的性能和质量(Wang et al., 2023f; Zhao et al., 2022)。
9 Agent Dataset and Leaderboard (代理数据集与排行榜)
To accelerate research in this domain, we propose two benchmarks respectively for multi-agent gaming and agentic visual language tasks. We will release two new datasets - “CuisineWorld” and “VideoAnalytica” - and a set of baseline models, encouraging participants to explore new models, systems, and submit their results on the test set of our leaderboard.
9.1 “CuisineWorld” Dataset for Multi-agent Gaming
CuisineWorld is a text-based game reminiscent of Overcooked! It offers a platform for AI-powered agents to cooperate and play in tandem. This dataset will test the collaboration efficiency of multi-agent systems, offering insights into how well LLMs and other systems can work together in dynamic scenarios. In particular, the dataset will focus on how well the agents understand goals, and how well the agents can coordinate among themselves. Two types of modes are supported in this dataset: a centralized dispatcher mode and a decentralized mode. Participants can choose a play mode and make a submission to our leaderboard.
9.1.1 Benchmark (基准测试)
For our competition, we will release a benchmark, the CuisineWorld benchmark, which includes a text interface that includes extendable task definition files, and an interface for multi-agent interaction, and human-machine interactions. We introduce the gaming interaction task in which the goal is to generate relevant, appropriate, multi-agent collaboration strategies that can maximize collaboration efficiency. We evaluate the collaboration efficiency with the proposed evaluation metric: CoS.
The “CuisineWorld" dataset was collected by Microsoft, UCLA, and Stanford University. The goal of the competition is to explore how different, existing and novel, grounded-LLM and interactive techniques perform with this benchmark and establish strong baselines for the task of multi-agent gaming infrastructure.
The dataset of CuisineWorld includes:
- A selection of well-defined multi-agent collaboration tasks. - An API system to facilitate agent interactions.
- An automatic evaluation system.
(The link for downloading the dataset will soon be made available and this article will be updated to include it here.)
9.1.2 Task (任务)
• We provide a dataset and related the benchmark, called Microsoft MindAgent and and correspondingly release a dataset “CuisineWorld” to the to the research community.
• We will provide benchmarks to evaluate and rank the submitted “MindAgent" algorithms. We will also provide baseline results generated using popular infrastructures.
9.1.3 Metrics and Judging (指标与评判)
The quality of multi-agent collaboration efficiency is determined by the new “cos" auto-metric (from MindAgent (Gong et al., 2023a)). The final rating of out metric is calculated as an average over the evaluated collaboration efficiency metrics of the multi-agent system on all tasks. Human evaluators will be asked to rate individual responses as well as provide subjective judgement of the engagement, breadth and an overall quality of the users’ interactions with the agents.
9.1.4 Evaluation (评估)
• Automated Evaluation. We plan to release a leaderboard, starting on the release date (TBA), registered participants will be asked to submit their results on the task associated with the dataset “CuisineWorld" (our publicly released dataset for the leaderboard). Submission of results will be closed on the end date (TBA). Each team will be required to submit their generated results on the testing set for automated evaluation of the “cos" metric.
• Human Evaluation on our leaderboard. The leaderboard participants will need to provide a submission file generated by evaluation scripts locally. We will use the evalAI system to check the submission file and optionally rerun the code for top challenge contenders. Therefore, teams must also submit their code with a Readme file on how to run their code. Human evaluation will be performed by the organization team.
• Winner Announcement. We will make an announcement of the winners and post the final ratings of the submissions on our leaderboard.
9.2 Audio-Video-Language Pre-training Dataset. (音频-视频-语言预训练数据集)
We introduce VideoAnalytica: a new benchmark for analytical video demonstration comprehension. VideoAnalytica focuses on leveraging video demonstrations as aids to better understand complex, high-level reasoning embedded within long-formed instructional videos. The objective is to evaluate the cognitive reasoning abilities of video language models, pushing them beyond mere recognition tasks and basic comprehension, towards a more sophisticated and nuanced understanding of videos. Crucially, VideoAnalytica emphasizes the integration of multiple modalities, such as audio, video, and language, as well as the ability of models to apply domain-specific knowledge, to contextualize and interpret the information presented in the videos. Specifically, VideoAnalytica involves two primary tasks:
1. Video Text Retrieval: This task involves accurately retrieving relevant text from the instructional videos. The challenge lies in distinguishing between relevant and irrelevant information, thus requiring a deep understanding of the video content, and analysis of the demonstration to retrieve the correct query. To further increase the complexity of these tasks, we introduce hard negatives into our datasets generated by large language models. We run human validation on the generated negatives and remove instances that make the task invalid and unfair (e.g. negatives being valid).
2. Video Assisted Informative Question Answering: This task requires the model to answer questions based on the information extracted from the videos. The focus is on complex questions that require analytical reasoning and a thorough comprehension of the video demonstration.
To facilitate the development of an audio-video-language agent for analytical video understanding, we introduce a benchmark leaderboard for the two tasks from VideoAnalytica.
• The leaderboard participants will need to submit their solutions for evaluation. The evaluation will be based on the model’s performance on the two tasks, and the results will be displayed on the leaderboard. Participants are required to submit their code, along with a detailed explanation of their approach and methodology.
• Ethical considerations: The leaderboard focuses on understanding and interpreting video content, which could potentially be used in surveillance or other privacy-invasive applications. Therefore, it’s crucial to consider the ethical implications and potential misuse of the technology. We encourage participants to consider these aspects in their submissions and promote the ethical use of AI.
### 9 代理数据集与排行榜(Agent Dataset and Leaderboard)
为了加速该领域的研究,我们提出了两个基准测试,分别针对多代理游戏和具身视觉语言任务。我们将发布两个新的数据集——“CuisineWorld”和“VideoAnalytica”,以及一组基线模型,鼓励参与者探索新模型、系统,并在我们的排行榜测试集上提交结果。
---
### 9.1 “CuisineWorld” 数据集用于多代理游戏(CuisineWorld Dataset for Multi-agent Gaming)
**CuisineWorld** 是一款基于文本的游戏,让人联想到《Overcooked!》。它为由AI驱动的代理提供了一个合作和协同游戏的平台。该数据集将测试多代理系统的协作效率,揭示大型语言模型(LLMs)和其他系统在动态场景中如何协同工作。特别是,数据集将关注代理对目标的理解程度,以及它们之间协调的能力。该数据集支持两种模式:集中调度模式(centralized dispatcher mode)和去中心化模式(decentralized mode)。参与者可以选择一种游戏模式并向我们的排行榜提交结果。
---
#### 9.1.1 基准测试(Benchmark)
在我们的竞赛中,我们将发布一个基准测试,即 **CuisineWorld 基准测试**,其中包括一个包含可扩展任务定义文件的文本接口、一个多代理交互接口以及人机交互接口。我们引入了游戏交互任务,其目标是生成相关、适当的多代理协作策略,以最大化协作效率。我们使用提出的评估指标 **CoS** 来评估协作效率。
**CuisineWorld** 数据集由微软、加州大学洛杉矶分校(UCLA)和斯坦福大学收集。竞赛的目标是探索不同的现有和新颖的具身语言模型(grounded-LLM)及交互技术在这一基准上的表现,并为多代理游戏基础设施任务建立强大的基线。
**CuisineWorld** 数据集包括:
- 一系列明确定义的多代理协作任务(multi-agent collaboration tasks)。
- 一个促进代理交互的API系统(API system)。
- 一个自动评估系统(automatic evaluation system)。
(数据集下载链接即将发布,本文将更新并在此处包含链接。)
---
#### 9.1.2 任务(Task)
• 我们提供了一个数据集及相关基准测试,称为 **Microsoft MindAgent**,并相应地向研究社区发布了名为“CuisineWorld”的数据集。
• 我们将提供基准测试以评估和排名提交的“MindAgent”算法,并使用流行的基础设施生成基线结果。
---
#### 9.1.3 指标与评判(Metrics and Judging)
多代理协作效率的质量由新的“cos”自动指标(来自 **MindAgent (Gong et al., 2023a)**)决定。我们最终的评分是通过计算多代理系统在所有任务中的协作效率指标的平均值得出的。人类评估者将被要求对单个响应进行评分,并提供用户与代理交互的参与度、广度和整体质量的主观评价。
---
#### 9.1.4 评估(Evaluation)
• **自动化评估(Automated Evaluation)**
我们计划发布一个排行榜,从发布日期(待定)开始,注册参与者需要在与“CuisineWorld”数据集相关的任务上提交他们的结果。结果提交将在结束日期(待定)关闭。每个团队需要在测试集上提交生成的结果,用于自动化评估“cos”指标。
• **人类评估我们的排行榜(Human Evaluation on our leaderboard)**
排行榜参与者需要提供由本地评估脚本生成的提交文件。我们将使用 evalAI 系统检查提交文件,并根据需要重新运行顶级挑战竞争者的代码。因此,团队必须提交其代码以及包含如何运行代码的说明文档(Readme 文件)。人类评估将由组织团队执行。
• **获奖公告(Winner Announcement)**
我们将宣布获奖者并在排行榜上发布提交的最终评分。
---
### 9.2 音频-视频-语言预训练数据集(Audio-Video-Language Pre-training Dataset)
我们介绍 **VideoAnalytica**:一个新的分析性视频演示理解基准测试。**VideoAnalytica** 专注于利用视频演示作为辅助工具,以更好地理解嵌入在长形式教学视频中的复杂高层次推理。目标是评估视频语言模型的认知推理能力,推动它们超越单纯的识别任务和基础理解,迈向更复杂、更细致的视频理解。重要的是,**VideoAnalytica** 强调多种模态(如音频、视频和语言)的整合,以及模型应用领域特定知识的能力,以便对视频中呈现的信息进行上下文化和解释。具体来说,**VideoAnalytica** 包括两个主要任务:
1. **视频文本检索(Video Text Retrieval)**
此任务涉及从教学视频中准确检索相关文本。挑战在于区分相关信息和无关信息,这需要对视频内容有深入的理解,并对演示进行分析以检索正确的查询。为了进一步增加任务的复杂性,我们在由大型语言模型生成的数据集中引入了困难负样本。我们对生成的负样本进行人工验证,并移除使任务无效或不公平的实例(例如,负样本为有效的情况)。
2. **视频辅助信息问答(Video Assisted Informative Question Answering)**
此任务要求模型基于从视频中提取的信息回答问题,重点是需要分析推理和全面理解视频演示的复杂问题。
为了促进分析性视频理解的音频-视频-语言代理开发,我们为 **VideoAnalytica** 的两个任务引入了一个基准排行榜。
• 排行榜参与者需要提交他们的解决方案以供评估。评估将基于模型在这两个任务上的表现,结果将在排行榜上显示。参与者需要提交其代码,并附上对其方法和方法论的详细说明。
• **伦理考量(Ethical considerations)**
排行榜的重点是理解和解释视频内容,这可能潜在地用于监控或其他侵犯隐私的应用。因此,考虑技术的伦理影响和潜在滥用至关重要。我们鼓励参与者在其提交中考虑这些方面,并促进人工智能的伦理使用。
10 Broader Impact Statement (更广泛影响声明)
This article and our associated forum 7 aim to be a catalyst for innovative research, fostering collaborations that will drive the next wave of AI applications. By focusing on multimodal agents, we emphasize the future direction of human-AI interactions, leader-board, and solutions. We detail three ways in which we make significant contributions to the broader community.
Firstly, we hope our forum grounds AI researchers to develop solutions motivated by real-world problems in gaming, robotics, healthcare, and long-video understanding. Specifically, the development of multimodal agents in gaming could lead to more immersive and personalized gaming experiences, thereby transforming the gaming industry. In robotics, the development of adaptive robotic systems could revolutionize industries ranging from manufacturing to agriculture, potentially addressing labor shortages and improving efficiency. In healthcare, the use of LLMs and VLMs as diagnostic agents or patient care assistants could lead to more accurate diagnoses, improved patient care, and increased accessibility to medical services, particularly in underserved areas. Furthermore, the ability of these models to interpret long-form videos could have far-reaching applications, from enhancing online learning to improving technical support services. In general, the topics covered in our forum will have significant downstream effects on a wide range of industries and humans across the world.
Secondly, we hope our forum stands as a valuable resource for AI practitioners and researchers alike, serving as a platform to explore and deeply comprehend the diverse and complex leader-board that come with implementing AI agents across a wide variety of environments and situations. This exploration includes, for instance, understanding the specific limitations and potential hazards linked to Agentic AI systems when they are developed for specialized sectors such as healthcare diagnostics. In this domain, issues like dangerous hallucinations in AI behavior can pose significant risks, highlighting the critical need for meticulous design and testing. However, these specific leader-board may not be equally relevant or noticeable when considering AI agents crafted for the gaming industry. In such recreational fields, developers might instead prioritize tackling different hurdles, such as the need for AI to perform more open-ended generation and exhibit creativity, adapting dynamically to unpredictable gameplay scenarios and player interactions. By attending the forum, participants will gain insights into how these varied environments dictate the focus and direction of AI development, and how best to tailor AI solutions to meet these distinct needs and overcome the pertinent leader-board.
Thirdly, the various elements of our event, including the expert presentations, informative posters, and notably the winners of our two leader-board, are set to offer a substantive yet succinct overview of the latest and significant trends, research directions, and innovative concepts in the realm of multimodal agents. These presentations will encapsulate pivotal findings and developments, shining a light on new systems, ideas, and technologies in the field of mulitmodal agent AI. This assortment of knowledge is not only beneficial for the attendees of our forum, who are looking to deepen their understanding and expertise in this domain, but it also serves as a dynamic and rich resource board. Those visiting our forum’s website can tap into this reservoir of information to discover and understand the cutting-edge advancements and creative ideas steering the future of multimodal agent AI. We strive to serve as a useful knowledge base for both newcomers and veterans in the field. By engaging with these resources, we hope participants and online visitors alike can remain informed of the transformative changes and novel approaches that are shaping the exciting landscape surrounding multimodal agent AI.
### 10 更广泛影响声明(Broader Impact Statement)
本文及我们相关的论坛7旨在成为创新研究的催化剂,促进合作以推动下一波人工智能应用的发展。通过聚焦多模态代理(multimodal agents),我们强调了人机交互、排行榜(leader-board)和解决方案的未来方向。我们详细说明了对更广泛社区做出重大贡献的三种方式。
---
#### 第一,
我们希望我们的论坛能够引导AI研究人员开发以现实世界问题为驱动力的解决方案,这些问题涉及游戏、机器人、医疗保健和长视频理解等领域。具体来说,游戏中的多模态代理开发可能会带来更具沉浸感和个性化的游戏体验,从而改变游戏行业。在机器人领域,自适应机器人系统的开发可能革新从制造业到农业等多个行业,潜在解决劳动力短缺问题并提高效率。在医疗保健领域,使用大型语言模型(LLMs)和视觉语言模型(VLMs)作为诊断代理或患者护理助手,可以实现更准确的诊断、改善患者护理,并提高医疗服务的可及性,特别是在服务不足的地区。此外,这些模型解读长视频的能力可能有广泛的应用,从增强在线学习到改进技术支持服务。总体而言,我们论坛涵盖的主题将对全球范围内的多个行业和人类产生显著的下游影响。
---
#### 第二,
我们希望我们的论坛成为AI从业者和研究人员的宝贵资源,作为一个平台来探索和深入理解在各种环境和情境中实施AI代理所带来的多样化和复杂的排行榜(leader-board)。这种探索包括,例如,理解与具身AI系统(Agentic AI systems)相关的特定限制和潜在风险,特别是在为医疗诊断等专业领域开发时。在这一领域,诸如AI行为中的危险幻觉等问题可能带来重大风险,凸显了精心设计和测试的迫切需求。然而,当考虑为游戏行业定制的AI代理时,这些特定的排行榜可能并不同样相关或明显。在这样的娱乐领域,开发者可能更优先解决不同的挑战,例如需要AI进行更开放式的生成并展现创造力,动态适应不可预测的游戏场景和玩家互动。通过参与论坛,参与者将了解这些多样化的环境如何决定AI发展的重点和方向,以及如何最好地调整AI解决方案以满足这些独特需求并克服相关排行榜。
---
#### 第三,
我们活动的各个元素,包括专家演讲、信息海报,特别是我们两个排行榜(leader-board)的获奖者,将提供关于多模态代理领域的最新趋势、研究方向和创新概念的实质性而简洁的概览。这些演讲将总结关键发现和发展,揭示多模态代理AI领域的新系统、新想法和技术。这些知识不仅有助于论坛参与者加深对该领域的理解和专业知识,还作为一个动态且丰富的资源库服务于更多人。访问我们论坛网站的人可以利用这一信息库,发现并理解引领多模态代理AI未来发展的前沿进展和创意理念。我们努力成为该领域新人和资深人士的有用知识库。通过与这些资源互动,我们希望参与者和线上访客都能及时了解正在塑造多模态代理AI激动人心前景的变革性变化和新方法。
11 Ethical Considerations (伦理考量)
Multimodal Agent AI systems have many applications. In addition to interactive AI, grounded multimodal models could help drive content generation for bots and AI agents, and assist in productivity applications, helping to re-play, paraphrase, action prediction or synthesize 3D or 2D scenario. Fundamental advances in agent AI help contribute towards these goals and many would benefit from a greater understanding of how to model embodied and empathetic in a simulate reality or a real world. Arguably many of these applications could have positive benefits.
However, this technology could also be used by bad actors. Agent AI systems that generate content can be used to manipulate or deceive people. Therefore, it is very important that this technology is developed in accordance with responsible AI guidelines. For example, explicitly communicating to users that content is generated by an AI system and providing the user with controls in order to customize such a system. It is possible the Agent AI could be used to develop new methods to detect manipulative content - partly because it is rich with hallucination performance of large foundation model - and thus help address another real world problem.
For examples, 1) in health topic, ethical deployment of LLM and VLM agents, especially in sensitive domains like healthcare, is paramount. AI agents trained on biased data could potentially worsen health disparities by providing inaccurate diagnoses for underrepresented groups. Moreover, the handling of sensitive patient data by AI agents raises significant privacy and confidentiality concerns. 2) In the gaming industry, AI agents could transform the role of developers, shifting their focus from scripting non-player characters to refining agent learning processes. Similarly, adaptive robotic systems could redefine manufacturing roles, necessitating new skill sets rather than replacing human workers. Navigating these transitions responsibly is vital to minimize potential socio-economic disruptions.
Furthermore, the agent AI focuses on learning collaboration policy in simulation and there is some risk if directly apply- ing the policy to the real world due to the distribution shift. Robust testing and continual safety monitoring mechanisms should be put in place to minimize risks of unpredictable behaviors in real-world scenarios. Our “VideoAnalytica" dataset is collected from the Internet and considering which is not a fully representative source, so we already go through-ed the ethical review and legal process from both Microsoft and University Washington. Be that as it may, we also need to understand biases that might exist in this corpus. Data distributions can be characterized in many ways. In this workshop, we have captured how the agent level distribution in our dataset is different from other existing datasets. However, there is much more than could be included in a single dataset or workshop. We would argue that there is a need for more approaches or discussion linked to real tasks or topics and that by making these data or system available.
We will dedicate a segment of our project to discussing these ethical issues, exploring potential mitigation strategies, and deploying a responsible multi-modal AI agent. We hope to help more researchers answer these questions together via this paper.
### 11 伦理考量(Ethical Considerations)
多模态代理AI系统有广泛的应用。除了交互式AI,基于场景的多模态模型还可以帮助驱动机器人和AI代理的内容生成,并辅助生产力应用,例如帮助重播、改写、行为预测或合成3D或2D场景。代理AI领域的基础性进展有助于实现这些目标,许多人将从对如何在模拟现实或真实世界中建模具身性和共情能力的更深入理解中受益。可以说,许多这些应用可能带来积极的影响。
然而,这项技术也可能被不良行为者利用。生成内容的代理AI系统可以用来操纵或欺骗人们。因此,根据负责任的AI指南开发这项技术非常重要。例如,明确告知用户内容是由AI系统生成的,并为用户提供控制选项以定制此类系统。代理AI有可能被用于开发检测操纵性内容的新方法——部分原因是其具有大型基础模型丰富的幻觉性能——从而帮助解决另一个现实世界的问题。
例如:
1) **在健康领域**,大型语言模型(LLM)和视觉语言模型(VLM)代理的伦理部署尤为重要,尤其是在医疗保健等敏感领域。基于偏差数据训练的AI代理可能会通过为代表性不足的群体提供不准确的诊断而加剧健康差距。此外,AI代理处理敏感患者数据引发了重大的隐私和保密问题。
2) **在游戏行业**,AI代理可能会改变开发者的角色,使他们的重点从编写非玩家角色脚本转向优化代理学习过程。同样,自适应机器人系统可能会重新定义制造业的角色,需要新的技能集而不是取代人类工人。负责任地应对这些转型对于最小化潜在的社会经济中断至关重要。
此外,代理AI专注于在模拟环境中学习协作策略,但直接将该策略应用于现实世界存在一定的风险,因为可能存在分布偏移(distribution shift)。应建立强大的测试和持续安全监控机制,以最小化现实场景中不可预测行为的风险。我们的“VideoAnalytica”数据集来自互联网,考虑到它并非完全代表性的来源,我们已经通过了微软和华盛顿大学的伦理审查和法律程序。尽管如此,我们还需要了解此语料库中可能存在的一些偏差。数据分布可以通过多种方式描述。在本次研讨会中,我们已经分析了我们数据集中代理级别的分布与其他现有数据集的不同之处。然而,单个数据集或研讨会无法涵盖所有内容。我们认为,需要更多与实际任务或主题相关的研究方法或讨论,并通过开放这些数据或系统来推动进展。
我们将专门留出项目的一部分来讨论这些伦理问题,探索潜在的缓解策略,并部署一个负责任的多模态AI代理。我们希望通过本文帮助更多研究人员共同回答这些问题。
12 Diversity Statement (多样性声明)
By examining the adaptability of AI agent models in various domains, we inherently embrace a diversity of leader-board, perspectives, and solutions. In this vein, our project aims to build a diverse community by exploring the wide array of subjects in multimodal and agentic AI.
With these principles in mind, this project focuses on advanced multimodal systems that interact effectively within both physical and virtual environments and facilitate effective interaction with humans. As such, we intend to engage a broad range of experts and practitioners across a wide-range of technical specialities, cultures, countries, and scholarly fields to discuss important topics, including but not limited to:
- Application of foundation models: the development of agents with integrated modalities (audio, image, text, sensor inputs), aiming to enhance their recognition and response capabilities for a wide variety of applications.
- General-purpose end-to-end systems: the development of end-to-end models that are trained with large-scale data, seeking to create versatile and adaptable AI solutions.
- Methodologies for grounding modalities: integrating information across various modalities, enhancing the coherence and efficacy of data processing.
- Intuitive human interface: the development of effective and meaningful interaction between humans and agents.
- Taming LLM/VLMs: exploring new approaches to address common issues in large-scale models, such as hallucinations and biases in their outputs.
We aspire to broaden our collective understanding of the potential and limitations of agentic AI by leveraging our unique and diverse perspectives. We strongly believe that this approach will not only enrich individual perspectives, but will also enhance the community’s collective knowledge and promote a holistic view that is more inclusive of the wide-ranging leader-board faced by multimodal AI agents.
### 12 多样性声明(Diversity Statement)
通过考察AI代理模型在各个领域的适应性,我们本质上拥抱了多样化的排行榜(leader-board)、视角和解决方案。基于这一理念,我们的项目旨在通过探索多模态和具身AI的广泛主题来构建一个多元化的社区。
秉持这些原则,本项目专注于先进的多模态系统,这些系统能够在物理和虚拟环境中有效交互,并促进与人类的有效互动。因此,我们计划吸引来自广泛技术专长、文化、国家和学术领域的专家和从业者,共同讨论重要话题,包括但不限于:
- **基础模型的应用(Application of foundation models)**:开发具有集成模态(音频、图像、文本、传感器输入)的代理,目标是提升其在各种应用中的识别和响应能力。
- **通用端到端系统(General-purpose end-to-end systems)**:开发使用大规模数据训练的端到端模型,寻求创建多功能且灵活的AI解决方案。
- **模态接地的方法论(Methodologies for grounding modalities)**:跨多种模态整合信息,增强数据处理的一致性和有效性。
- **直观的人机界面(Intuitive human interface)**:开发人类与代理之间有效且有意义的交互方式。
- **驯服LLM/VLMs(Taming LLM/VLMs)**:探索解决大规模模型常见问题的新方法,例如输出中的幻觉和偏差问题。
我们希望通过利用我们独特而多样的视角,拓宽对具身AI潜力和局限性的集体理解。我们坚信,这种方法不仅将丰富个人视角,还将增强整个社区的集体知识,并推动一种更具包容性的整体观,涵盖多模态AI代理面临的广泛排行榜(leader-board)。
References
参考 见原论文
1574

被折叠的 条评论
为什么被折叠?



