OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation 翻译-优快云博客

本文链接：https://blog.youkuaiyun.com/Doc2X/article/details/143949828

Doc2X | PDF转Markdown 专家
快速将 PDF 转换为 Markdown 格式，支持多栏识别、公式解析和代码提取，让文档处理更轻松。
Doc2X | PDF to Markdown Expert
Convert PDFs to Markdown quickly with multi-column recognition, formula parsing, and code extraction for hassle-free document handling.
👉 立即试用 Doc2X | Try Doc2X Now

原文链接：https://arxiv.org/pdf/2410.11792

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

OKAMI：通过单个视频模仿教人形机器人操作技能

Jinhan ${\mathrm{{Li}}}^{1 \dagger }\;$ Yifeng ${\mathrm{{Zhu}}}^{1 * }$

金翰 ${\mathrm{{Li}}}^{1 \dagger }\;$ 易峰 ${\mathrm{{Zhu}}}^{1 * }$

Georgios Pavlakos ${}^{1}\;$ Yuke Zhu ${}^{1,2}$

乔治奥斯·帕夫拉科斯 ${}^{1}\;$ 朱克 ${}^{1,2}$

UT Austin1 NVIDIA Research2

德克萨斯大学奥斯汀分校1 NVIDIA研究院2

Abstract: We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies,which achieve an average success rate of ${79.2}\%$ without the need for labor-intensive teleoperation. More videos can be found on our website

摘要：我们研究了通过模仿单个视频演示来教人形机器人操作技能的问题。我们介绍了OKAMI，一种从单个RGB-D视频生成操作计划并推导出执行策略的方法。我们方法的核心是对象感知的重定向，这使得人形机器人能够在部署过程中调整到不同的对象位置时模仿RGB-D视频中的人类动作。OKAMI使用开放世界的视觉模型来识别与任务相关的对象，并分别重定向身体动作和手部姿势。我们的实验表明，OKAMI在不同的视觉和空间条件下实现了强大的泛化能力，超越了从观察中进行开放世界模仿的最先进基线。此外，OKAMI的滚动轨迹被用于训练闭环视觉运动策略，这些策略在没有劳动密集型远程操作的情况下实现了 ${79.2}\%$ 的平均成功率。更多视频可以在我们的网站上找到。

https://ut-austin-rpl.github.io/OKAMI/.

Keywords: Humanoid Manipulation, Imitation From Videos, Motion Retargeting

关键词：人形操作，从视频中模仿，动作重定向

Figure 1: OKAMI enables a human user to teach the humanoid robot how to perform a new task by providing a single video demonstration.

图1：OKAMI使人类用户能够通过提供单个视频演示来教人形机器人如何执行新任务。

1 Introduction

1 引言

Deploying generalist robots to assist with everyday tasks requires them to operate autonomously in natural environments. With recent advances in hardware designs and increased commercial availability, humanoid robots emerge as a promising platform to deploy in our living and working spaces. Despite their great potential, they still struggle to operate autonomously and deploy robustly in the 8th Conference on Robot Learning (CoRL 2024), Munich, Germany. unstructured world. A burgeoning line of work has resorted to deep imitation learning methods for humanoid manipulation [1-3]. However, they rely on large amounts of demonstrations through whole-body teleoperation, requiring domain expertise and strenuous efforts. In contrast, humans have the innate ability to watch their peers do a task once and mimic the behaviors. Equipping robots with the ability to imitate from visual observations will move us closer to the goal of training robotic foundation models from Internet-scale human activity videos.

部署通用机器人来协助日常任务需要它们在自然环境中自主操作。随着硬件设计的最新进展和商业可用性的增加，人形机器人作为在我们的生活和工作空间中部署的有前途的平台出现。尽管它们具有巨大的潜力，但它们仍然难以在第8届机器人学习会议（CoRL 2024，德国慕尼黑）上自主操作并在非结构化世界中稳健部署。一个新兴的研究方向已经转向深度模仿学习方法用于人形操作[1-3]。然而，它们依赖于通过全身远程操作进行的大量演示，需要领域专业知识和艰苦的努力。相比之下，人类具有天生的能力，只需观看一次任务并模仿行为。赋予机器人从视觉观察中模仿的能力将使我们更接近从互联网规模的人类活动视频中训练机器人基础模型的目标。

${}^{ \dagger }$ This work was done while Jinhan Li was a visiting researcher at UT Austin.

${}^{ \dagger }$ 这项工作是在李金涵作为德克萨斯大学奥斯汀分校访问研究员期间完成的。

Equal contribution.
同等贡献。

We explore teaching humanoid robots to manipulate objects by watching humans. We consider a problem setting recently formulated as “open-world imitation from observation,” where a robot imitates a manipulation skill from a single video of human demonstration [4-6]. This setting would facilitate users in effortlessly demonstrating tasks and enable a humanoid robot to acquire new skills quickly. Enabling humanoids to imitate from single videos presents a significant challenge - the video does not have action labels, but yet the robot has to learn to perform tasks in new situations beyond what’s demonstrated in the video. Prior works on one-shot video learning have attempted to optimize robot actions to reconstruct the future object motion trajectories $\left\lbrack {4,5}\right\rbrack$ . However,they have been applied to single-arm manipulators and are computationally prohibitive for humanoid robots due to their high degrees of freedom and joint redundancy [7]. Meanwhile, the similar kinematic structure shared by humans and humanoids makes directly retargeting human motions to robots feasible $\left\lbrack {8,9}\right\rbrack$ . Nonetheless,existing retargeting techniques focus on free-space body motions $\left\lbrack {{10} - {14}}\right\rbrack$ , lacking the contextual awareness of objects and interactions needed for manipulation. To address this shortcoming, we introduce the concept of “object-aware retargeting”. By incorporating object contextual information into the retargeting process, the resulting humanoid motions can be efficiently adapted to the locations of objects in open-ended environments.

我们探索了通过观察人类来教人形机器人操纵物体的方法。我们考虑了一个最近被定义为“开放世界模仿观察”的问题设置，其中机器人通过观看人类演示的单一视频来模仿操纵技能 [4-6]。这种设置将使用户能够轻松地演示任务，并使机器人能够快速获取新技能。使机器人能够从单一视频中模仿是一个重大挑战——视频没有动作标签，但机器人必须在视频演示之外的新情况下学习执行任务。先前关于一次性视频学习的研究尝试优化机器人动作以重建未来物体运动轨迹 $\left\lbrack {4,5}\right\rbrack$ 。然而，这些方法已应用于单臂操纵器，并且由于人形机器人的高自由度和关节冗余性，计算上对人形机器人来说是不可行的 [7]。同时，人类和人形机器人之间的相似运动结构使得直接将人类动作重新定位到机器人成为可能 $\left\lbrack {8,9}\right\rbrack$ 。尽管如此，现有的重新定位技术主要关注自由空间的身体运动 $\left\lbrack {{10} - {14}}\right\rbrack$ ，缺乏对物体和交互的上下文感知，这对于操纵是必要的。为了解决这一不足，我们引入了“物体感知重新定位”的概念。通过将物体上下文信息纳入重新定位过程，所产生的人形运动可以有效地适应开放环境中物体的不同位置。

To this end, we introduce OKAMI (Object-aware Kinematic retArgeting for huManoid Imitation), an object-aware retargeting method that enables a bimanual humanoid with two dexterous hands to imitate manipulation behaviors from a single RGB-D video demonstration. OKAMI uses a two-stage process to retarget the human motions to the humanoid robot to accomplish the task across varying initial conditions. The first stage processes the video to generate a reference manipulation plan. The second stage uses this plan to synthesize the humanoid motions through motion retargeting that adapts to the object locations in target environments.

为此，我们引入了 OKAMI（Object-aware Kinematic retArgeting for huManoid Imitation），这是一种物体感知的重新定位方法，使具有两只灵巧手的双臂人形机器人能够通过单一 RGB-D 视频演示来模仿操纵行为。OKAMI 使用两阶段过程将人类动作重新定位到人形机器人，以在不同的初始条件下完成任务。第一阶段处理视频以生成参考操纵计划。第二阶段使用此计划通过运动重新定位来合成人形运动，以适应目标环境中物体的位置。

OKAMI consists of two key designs. The first design is an open-world vision pipeline that identifies task-relevant objects, reconstructs human motions from the video, and localizes task-relevant objects during evaluation. Localizing objects at test time also enables motion retargeting to adapt to different backgrounds or new object instances of the same categories. The second design is the factorized process for retargeting, where we retarget the body motions and hand poses separately. We first retarget the body motions from the reference plan in the task space, and then warp the retar-geted trajectory given the location of task-relevant objects. The trajectory of body joints is obtained through inverse kinematics. The joint angles of fingers are mapped from the plan onto the dexterous hands, reproducing hand-object interaction. With object-aware retargeting, OKAMI policies systematically generalize across various spatial layouts of objects and scene clutters. Finally, we train visuomotor policies on the rollout trajectories from OKAMI through behavioral cloning to obtain vision-based manipulation skills.

OKAMI 由两个关键设计组成。第一个设计是一个开放世界的视觉管道，用于识别与任务相关的对象，从视频中重建人类动作，并在评估期间定位任务相关对象。在测试时定位对象还支持运动重定向，以适应不同的背景或同一类别的新对象实例。第二个设计是分解的重定向过程，其中我们将身体动作和手部姿势分别重定向。我们首先在任务空间中从参考计划重定向身体动作，然后根据任务相关对象的位置对重定向的轨迹进行变形。通过逆运动学获得身体关节的轨迹。手指的关节角度从计划映射到灵巧的手上，再现手-对象交互。通过对象感知的重定向，OKAMI 策略系统地推广到各种对象的空间布局和场景杂乱中。最后，我们通过行为克隆在 OKAMI 的回滚轨迹上训练视觉运动策略，以获得基于视觉的操作技能。

We evaluate OKAMI on human video demonstrations of diverse tasks that cover rich object interactions, such as picking, placing, pushing, and pouring. We show that its object-aware retargeting achieves ${71.7}\%$ task success rates averaged across all tasks and outperforms the ORION [4] baseline by ${58.3}\%$ . We then train closed-loop visuomotor policies on the trajectories generated by OKAMI, achieving an average success rate of ${79.2}\%$ . Our contributions of OKAMI are three-fold:

我们在涵盖丰富对象交互的多样化任务的人类视频演示上评估 OKAMI，例如拾取、放置、推动和倾倒。我们展示了其对象感知的重定向实现了 ${71.7}\%$ 的平均任务成功率，并优于 ORION [4] 基线 ${58.3}\%$ 。然后，我们在 OKAMI 生成的轨迹上训练闭环视觉运动策略，实现了 ${79.2}\%$ 的平均成功率。OKAMI 的贡献有三方面：

OKAMI enables a humanoid robot to mimic human behaviors from a single video for dexterous manipulation. Its object-aware retargeting process generates feasible motions of the humanoid robot while adapting the motions to target object locations at test time;
OKAMI 使类人机器人能够从单一视频中模仿人类行为进行灵巧操作。其对象感知的重定向过程在适应测试时目标对象位置的同时，生成类人机器人的可行运动；
OKAMI uses vision foundation models [15, 16] to identify task-relevant objects without additional human inputs. Their common-sense reasoning ability helps recognize task-relevant objects even if they are not directly in contact with other objects or the robot hands, allowing our method to imitate more diverse tasks than prior work;
OKAMI 使用视觉基础模型 [15, 16] 来识别任务相关对象，无需额外的人工输入。它们的常识推理能力有助于识别任务相关对象，即使这些对象并未直接与其他对象或机器人手接触，从而使我们的方法能够模仿比以往工作更多样化的任务；
We validate OKAMI’s strong spatial and visual generalization abilities on humanoid hardware. OKAMI enables real-robot deployment in natural environments with unseen object layouts, varying visual backgrounds, and new object instances.
我们在人形硬件上验证了 OKAMI 强大的空间和视觉泛化能力。OKAMI 使得在自然环境中进行真实机器人部署成为可能，这些环境具有未见过的对象布局、变化的视觉背景和新对象实例。

2 Related Work

2 相关工作

Humanoid Robot Control. Methods like motion planning and optimal control have been developed for humanoid locomotion and manipulation [10, 12, 17]. These model-based approaches rely on precise physical modeling and expensive computation [11, 12, 18]. To mitigate the stringent requirements, researchers have explored policy training in simulation and sim-to-real transfer $\left\lbrack {{10},{19}}\right\rbrack$ . However,these methods still require a significant amount of labor and expertise in designing simulation tasks and reward functions, limiting their successes to locomotion domains. In parallel to automated methods, a variety of human control mechanisms and devices have been developed for humanoid teleoperation using motion capture suits [9, 12, 20-24], telexistence cockpits [25-29], VR devices [1, 30, 31], or videos that track human bodies [17, 32]. While these systems can control the robots to generate diverse behaviors, they require real-time human input that poses significant cognitive and physical burdens. In contrast, OKAMI only requires single RGB-D human videos to teach the humanoid robot new skills, significantly reducing the human cost.

人形机器人控制。诸如运动规划和最优控制等方法已为人形机器人的移动和操作开发出来 [10, 12, 17]。这些基于模型的方法依赖于精确的物理建模和昂贵的计算 [11, 12, 18]。为了缓解这些严格的要求，研究人员探索了在模拟中进行策略训练和从模拟到现实的迁移 $\left\lbrack {{10},{19}}\right\rbrack$ 。然而，这些方法仍然需要大量的劳动力和设计模拟任务及奖励函数的专业知识，限制了它们在移动领域的成功。与自动化方法并行，已经开发了各种人类控制机制和设备，用于通过动作捕捉套装 [9, 12, 20-24]、远程存在驾驶舱 [25-29]、VR 设备 [1, 30, 31] 或跟踪人体视频 [17, 32] 进行人形远程操作。尽管这些系统可以控制机器人产生多样化的行为，但它们需要实时的人类输入，这带来了显著的认知和身体负担。相比之下，OKAMI 仅需要单个 RGB-D 人体视频来教授人形机器人新技能，显著降低了人力成本。

Imitation Learning for Robot Manipulation. Imitation Learning has significantly advanced vision-based robot manipulation with high sample efficiency [33-44]. Prior works have shown that robots can learn visuomotor policies to complete various tasks with just dozens of demonstrations, ranging from long-horizon manipulation [34-36] to dexterous manipulation [37-39]. However, collecting demonstrations often requires domain expertise and high costs, creating challenges to scale. Another line of work focuses on one-shot imitation learning [40-44], yet they demand excessive data collection for meta-training tasks. Recently, researchers have looked into a new problem setting of imitating from a single video demonstration [4-6], referred to as “open-world imitation from observation” [4]. Unlike prior works that abstract away embodiment motions due to kinematic differences between the robot and the human, we exploit embodiment motion information owing to the kinematic similarity between humans and humanoids. Specifically, we introduce object-aware retargeting that adapts human motions to humanoid robots.

机器人操作的模仿学习。模仿学习显著提高了基于视觉的机器人操作的样本效率 [33-44]。先前的工作表明，机器人可以通过仅数十次演示学习视觉运动策略来完成各种任务，从长时程操作 [34-36] 到灵巧操作 [37-39]。然而，收集演示通常需要领域专业知识和较高的成本，这给扩展带来了挑战。另一类工作专注于一次性模仿学习 [40-44]，但它们需要为元训练任务收集大量数据。最近，研究人员开始研究一个新的问题设置，即从单个视频演示中进行模仿 [4-6]，称为“开放世界观察模仿” [4]。与之前由于机器人和人类之间的运动学差异而抽象掉具体运动的工作不同，我们利用了人类和类人机器人之间运动学相似性的具体运动信息。具体来说，我们引入了对象感知的重定向，将人类运动适应到类人机器人。

Motion Retargeting. Motion retargeting has wide applications in computer graphics and 3D vision [8], where extensive literature studies how to adapt human motions to digital avatars [45-47]. This technique has been adopted in robotics for recreating human-like motions on humanoid or anthropomorphic robots through various retargeting methods, including optimization-based approaches [11, 12, 20, 48], geometric-based methods [49], and learning-based techniques [10, 13, 17]. However, in manipulation tasks, these retargeting methods have been used within teleoperation systems, lacking a vision pipeline for automatic adaptation to object locations. OKAMI integrates the retargeting process with open-world vision, endowing it with object awareness so that the robot can mimic human motions from video demonstrations and adapt to object locations at test time.

运动重定向。运动重定向在计算机图形学和 3D 视觉 [8] 中有广泛应用，其中大量文献研究如何将人类运动适应到数字化身 [45-47]。这项技术已被应用于机器人学中，通过各种重定向方法在类人或拟人机器人上重现类似人类的动作，包括基于优化的方法 [11, 12, 20, 48]、基于几何的方法 [49] 和基于学习的技术 [10, 13, 17]。然而，在操作任务中，这些重定向方法主要用于远程操作系统中，缺乏自动适应对象位置的视觉管道。OKAMI 将重定向过程与开放世界视觉相结合，赋予其对象感知能力，使得机器人可以从视频演示中模仿人类动作，并在测试时适应对象位置。

3 OKAMI

In this work, we introduce OKAMI, a two-staged method that tackles open-world imitation from observation for humanoid robots. OKAMI first generates a reference plan using the object locations and reconstructed human motions from a given RGB-D video. Then, it retargets the human motion trajectories to the humanoid robot while adapting the trajectories based on new locations of the objects. Figure 2 illustrates the whole pipeline.

在这项工作中，我们介绍了 OKAMI，这是一种针对人形机器人从观察中解决开放世界模仿问题的两阶段方法。OKAMI 首先使用给定 RGB-D 视频中的物体位置和重建的人类动作生成参考计划。然后，它将人类动作轨迹重新定位到人形机器人，同时根据物体的新位置调整轨迹。图 2 展示了整个流程。

Problem Formulation We formulate a humanoid manipulation task as a discrete-time Markov Decision Process defined by a tuple: $\left( {S,A,P,R,\gamma ,\mu }\right)$ ,where $S$ is the state space, $A$ is the action space, $P\left( {\cdot \mid s,a}\right)$ is the transition probability, $R\left( s\right)$ is the reward function, $\gamma \in \lbrack 0,1)$ is the discount factor,and $\mu$ is the initial state distribution. In our context, $S$ is the space of raw RGB-D observations that capture both the robot and object states, $A$ is the space of the motion commands for the humanoid robot, $R$ is the sparse reward function that returns 1 when a task is complete. The objective of solving a task is to find a policy $\pi$ that maximizes the expected task success rates from a wide range of initial configurations drawn from $\mu$ at test time.

问题表述我们将人形操作任务表述为一个离散时间马尔可夫决策过程，由一个元组定义： $\left( {S,A,P,R,\gamma ,\mu }\right)$ ，其中 $S$ 是状态空间， $A$ 是动作空间， $P\left( {\cdot \mid s,a}\right)$ 是转移概率， $R\left( s\right)$ 是奖励函数， $\gamma \in \lbrack 0,1)$ 是折扣因子， $\mu$ 是初始状态分布。在我们的上下文中， $S$ 是捕捉机器人和物体状态的原始 RGB-D 观测空间， $A$ 是人形机器人的运动命令空间， $R$ 是稀疏奖励函数，当任务完成时返回 1。解决任务的目标是找到一个策略 $\pi$ ，该策略在测试时从 $\mu$ 中抽取的广泛初始配置中最大化预期任务成功率。

Figure 2: Overview of OKAMI. OKAMI is a two-staged method that enables a humanoid robot to imitate a manipulation task from a single human video. In the first stage, OKAMI generates a reference plan using GPT- $4\mathrm{\;V}$ and large vision models for subsequent manipulation. In the second stage,OKAMI follows the reference plan, where it retargets human motions onto the humanoid with object awareness. The retargeted motions are converted into a sequence of robot joint commands for the robot to follow.

图 2：OKAMI 概述。OKAMI 是一种两阶段方法，使一个人形机器人能够从单个人类视频中模仿操作任务。在第一阶段，OKAMI 使用 GPT- $4\mathrm{\;V}$ 和大型视觉模型生成后续操作的参考计划。在第二阶段，OKAMI 遵循参考计划，将人类动作重新定位到具有物体感知能力的人形机器人上。重新定位的动作被转换为机器人关节命令序列，供机器人跟随。

We consider the setting of “open-world imitation from observation” [4], where the robot system takes a recorded RGB-D human video, $V$ as input,and returns a humanoid manipulation policy $\pi$ that completes the task as demonstrated in $V$ . This setting is “open-world” as the robot does not have prior knowledge or ground-truth access to the categories or physical states of objects involved in the task,and it is “from observation” in the sense that video $V$ does not come with any ground-truth robot actions. A policy execution is considered successful if the state matches the state of the final frame from $V$ . The success conditions of all tested tasks are described in Appendix B.1. Notably, two assumptions are made about $V$ in this paper: all the image frames in $V$ capture the human bodies,and the camera view of shooting $V$ is static throughout the recording.

我们考虑“开放世界模仿学习”的场景 [4]，其中机器人系统以录制的 RGB-D 人类视频 $V$ 作为输入，并返回一个类人操纵策略 $\pi$ ，该策略能够完成 $V$ 中演示的任务。这个场景是“开放世界”的，因为机器人对任务中涉及的对象类别或物理状态没有先验知识或真实访问权限，并且它是“从观察中学习”的，因为视频 $V$ 不附带任何真实机器人动作。如果状态与 $V$ 的最终帧状态匹配，则认为策略执行成功。所有测试任务的成功条件在附录 B.1 中描述。值得注意的是，本文对 $V$ 做了两个假设： $V$ 中的所有图像帧都捕捉到人体，并且拍摄 $V$ 的相机视角在整个录制过程中是静止的。

3.1 Reference Plan Generation

3.1 参考计划生成

To enable object-aware retargeting, OKAMI first generates a reference plan for the humanoid robot to follow. Plan generation involves understanding what task-relevant objects are and how humans manipulate them.

为了实现对象感知的重定向，OKAMI 首先为类人机器人生成一个参考计划以供遵循。计划生成涉及理解哪些是任务相关对象以及人类如何操纵它们。

Identifying and Localizing Task-Relevant Objects. To imitate manipulation tasks from videos $V$ ,OKAMI must identify the task-relevant objects to interact with. While prior methods rely on unsupervised approaches with simple backgrounds or require additional human annotations [50-53], OKAMI uses an off-the-shelf Vision-Language Models (VLMs), GPT-4V, to identify task-relevant objects in $V$ by leveraging the commonsense knowledge internalized in the model. Concretely, OKAMI obtains the names of task-relevant objects by sampling RGB frames from the video demonstration $V$ and prompting GPT-4V with the concatenation of these images (details in Appendix A.2). Using these object names, OKAMI employs Grounded-SAM [16] to segment the objects in the first frame and track their locations throughout the video using a Vidoe Object

识别和定位任务相关对象。为了从视频 $V$ 中模仿操作任务，OKAMI 必须识别出需要与之交互的任务相关对象。虽然先前的方法依赖于在简单背景下的无监督方法或需要额外的人工标注 [50-53]，但 OKAMI 使用现成的视觉语言模型 (VLMs)，即 GPT-4V，通过利用模型内部化的常识知识来识别 $V$ 中的任务相关对象。具体来说，OKAMI 通过从视频演示 $V$ 中采样 RGB 帧，并将这些图像与 GPT-4V 提示词连接起来（详见附录 A.2），从而获取任务相关对象的名称。利用这些对象名称，OKAMI 使用 Grounded-SAM [16] 对第一帧中的对象进行分割，并使用视频对象跟踪其在整个视频中的位置。

—— 更多内容请到Doc2X翻译查看——
—— For more content, please visit Doc2X for translations ——