LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models-优快云博客

本文链接：https://blog.youkuaiyun.com/s_m_c/article/details/141650017

发表时间：30 Mar 2023

论文链接：https://readpaper.com/pdf-annotate/note?pdfId=2423035600719930112&noteId=2424798413221639424

作者单位：The Ohio State University

Motivation：这项研究的重点是使用大型语言模型 (LLM) 作为具身代理的规划器，它可以遵循自然语言指令来完成视觉感知环境中的复杂任务。现有方法的高数据成本和较差的样本效率阻碍了开发能够完成许多任务的多功能agent，并可以快速学习新任务。

解决方法：在这项工作中，我们提出了一种新的方法LLM-Planner，它利用大型语言模型的力量对具身代理进行few-shot规划。我们进一步提出了一种简单但有效的方法，通过物理grounding来增强LLM，以生成和更新基于当前环境的计划。Despite using less than 0.5% of paired training data, LLM-Planner achieves competitive performance with recent baselines that are trained using the full training data.

实现方式：More specifically, we adopt hierarchical planning models, which consist of a high-level planner and a low-level planner：

We use LLMs to generate high-level plans (HLPs), i.e., a sequence of subgoals (e.g., [Navigation potato, Pickup potato, Navigation microwave, ...]) that the agent needs to achieve, in the specified order, to accomplish the final goal specified by the language instruction.
The low-level planner then maps each subgoal into a sequence of primitive actions for achieving that subgoal in the current environment and state.

此外，我们遵循in-context learning paradigm，仅使用少量pair示例。此外，不需要参数更新(主要是对大模型能力的一个适应到planner任务，没有训练参数)，节省了开发时间。

大脑和小脑实现细节：大脑：We define a high-level action to be a collection of primitive actions that can complete a single goal-condition in ALFRED [34]. We take the interaction actions directly from ALFRED and we only add the Navigation action. Therefore, the high-level action space consists of 1 navigation action (Navigation) and 7 interaction actions (PickupObject, PutObject, OpenObject, CloseObject, ToggleOnObject, ToggleOffObject, SliceObject). Similar actions are commonly used in other related work such as SayCan [1] and LM zero-shot planner。

小脑：The low-level planner maps each subgoal into a sequence of primitive actions Ll = [a0, a1, · · · , aTi ]。一旦high-level的指令确定了，low level 的动作就与intruction相互独立了。（相当于条件概率公式中少了个instruction I）

细节：

the first step is to design an appropriate prompt to guide them to generate high-level plans.（最终的 HLP 质量可能对提示中的微小设计选择敏感，也就是说prompt的设计很重要）。
We adopt a k-nearest neighbor (kNN) retriever to select the in-context examples。（我们使用冻结的基于 BERT 的模型 [7] 来评估每个训练示例和当前测试示例之间的成对相似度。对于每个测试示例，我们从我们有的一小部分配对训练示例中检索 K 个最相似的示例，其中 K 是我们在真实少样本设置下调整的超参数。）
we propose a novel grounded re-planning algorithm to enhance LLMs with the ability to ground to the current environment, which further improves the HLP quality.（为此，我们提出了一种简单但有效的方法，通过注入观察对象列表来增强具有物理接地的LLM，该列表可以使用具身agent的对象检测器从环境到提示中被检测到。在我们的算法中，重新规划将在两个条件中的任何一个下触发：1）代理无法执行动作，或 2）在固定时间步数之后。）
将LLM-Planner集成到现有的具身代理中，使其具有few-shot规划能力。（我们将其与满足这种接口的强基线方法 HLSM [3] 集成）。

实验：该数据集由 7 种任务类型组成，跨越 207 种独特环境、115 种不同对象类型和 4,703 个任务。任务的难度范围从将单个对象移动到新位置，以将对象的加热切片放入容器中。每个任务都有着高级目标的人工编写的注释和一系列更细粒度的逐步指令，由人工注释者创建，因为他们观看了任务的专家演示（各个步骤都是有着ground truth的）。

few-shot体现：For the main experiments, we chose 100 as the number of training examples without any cross-validation because it is our target number for the few-shot setting.（训练数据也是测试数据，不需要训练，每个任务100个样本）。

HLP 准确度通常随着更多的训练示例而提高，尽管我们开始在大约 250 个训练示例附近获得递减的回报。
当训练示例较少时，更多的上下文示例（in-context example）更有益（给大语言模型的例子），因为从中检索的有用示例较少。

结论：与 SayCan 不同的是：我们使用 LLM 直接生成计划而不是对可接受的技能进行排名，从而消除了对环境有足够的先验知识，同时还显着减少了对 LLM 的调用次数。LLM-Planner的另一个独特优势是它能够根据agent在当前环境中观察到的内容动态重新规划，从而产生更接地的plan。

Under the same few-shot setting, existing methods can barely complete any task successfully. Our work opens a new door for developing versatile and extremely sampleefficient embodied agents by harnessing the power of large language models and grounding.