On Bringing Robots Home
关于引入机器人到家庭
文章目录
Throughout history, we have successfully integrated various machines into our homes. Dishwashers, laundry machines, stand mixers, and robot vacuums are just a few recent examples. However, these machines excel at performing only a single task effectively. The concept of a “generalist machine” in homes – a domestic assistant that can adapt and learn from our needs, all while remaining cost-effective – has long been a goal in robotics that has been steadily pursued for decades. In this work, we initiate a large-scale effort towards this goal by introducing Dobb·E, an affordable yet versatile general-purpose system for learning robotic manipulation within household settings. Dobb·E can learn a new task with only five minutes of a user showing it how to do it, thanks to a demonstration collection tool (“The Stick”) we built out of cheap parts and iPhones. We use the Stick to collect 13 hours of data in 22 homes of New York City, and train Home Pretrained Representations (HPR). Then, in a novel home environment, with five minutes of demonstrations and fifteen minutes of adapting the HPR model, we show that Dobb·E can reliably solve the task on the Stretch, a mobile robot readily available on the market. Across roughly 30 days of experimentation in homes of New York City and surrounding areas, we test our system in 10 homes, with a total of 109 tasks in different environments, and finally achieve a success rate of 81%. Beyond success percentages, our experiments reveal a plethora of unique challenges absent or ignored in lab robotics. These range from effects of strong shadows to variable demonstration quality by nonexpert users. With the hope of accelerating research on home robots, and eventually seeing robot butlers in every home, we open-source Dobb·E software stack and models, our data, and our hardware designs.
在历史长河中,我们成功地将各种机器整合到我们的家庭中。洗碗机、洗衣机、搅拌机和扫地机器人只是一些最近的例子。然而,这些机器擅长有效地执行单一任务。在家庭中使用“通用机器”——一个可以适应并从我们的需求中学习的家庭助手,同时保持经济实惠——一直是机器人领域长期追求的目标。在这项工作中,我们通过引入Dobb·E,一个价格实惠而多才多艺的通用系统,致力于实现这一目标。Dobb·E能够在用户向它展示如何执行任务的五分钟内学会新任务,这得益于我们用廉价零件和iPhone制作的演示采集工具(“The Stick”)。我们使用这个工具在纽约市的22个家庭中收集了13小时的数据,并训练了家庭预训练表示(HPR)。然后,在一个新颖的家庭环境中,通过五分钟的演示和十五分钟的调整HPR模型,我们展示了Dobb·E能够可靠地在市场上随时可得的移动机器人Stretch上解决任务。在纽约市及周边地区的家庭中进行了约30天的实验,我们在10个家庭中测试了我们的系统,在不同环境中完成了总计109个任务,最终取得了81%的成功率。除了成功率之外,我们的实验证明了在实验室机器人中缺乏或被忽视的一系列独特挑战。这些挑战涉及从强烈阴影的影响到非专业用户演示质量的变化。希望加速家用机器人研究,并最终看到每个家庭都有机器人管家,我们开源了Dobb·E软件栈和模型、我们的数据和硬件设计。
Figure 1: We present Dobb·E, a simple framework to train robots, which is then field tested in homes across New York City. In under 30 mins of training per task, Dobb·E achieves 81% success rates on simple household tasks.
图1:我们呈现Dobb·E,一个简单的培训机器人的框架,然后在纽约市的家庭中进行现场测试。在每项任务的培训时间不到30分钟内,Dobb·E在简单的家务任务上达到了81%的成功率。
1 Introduction
1 引言
Since our transition away from a nomadic lifestyle, homes have been a cornerstone of human existence. Technological advancements have made domestic life more comfortable, through innovations ranging from simple utilities like water heaters to advanced smart-home systems. However, a holistic, automated home assistant remains elusive, even with significant representations in popular culture [1].
自从我们放弃游牧生活以来,家庭一直是人类存在的基石。技术的进步使得家庭生活更加舒适,通过从简单的水加热器到先进的智能家居系统的创新。然而,一个全面的、自动化的家庭助手仍然难以实现,即使在流行文化中有着显著的代表 [1]。
Our goal is to build robots that perform a wide-range of simple domestic tasks across diverse realworld households. Such an effort requires a shift from the prevailing paradigm – current research in robotics is predominantly either conducted in industrial environments or in academic labs, both containing curated objects, scenes, and even lighting conditions. In fact, even for the simple task of object picking [2] or point navigation [3] performance of robotic algorithms in homes is far below the performance of their lab counterparts. If we seek to build robotic systems that can solve harder, general-purpose tasks, we will need to reevaluate many of the foundational assumptions in lab robotics.
我们的目标是构建能够在不同真实家庭中执行各种简单家务任务的机器人。这样的努力需要从当前的范式转变——当前机器人领域的研究主要在工业环境或学术实验室中进行,这两者都包含精心策划的物体、场景,甚至照明条件。实际上,即使对于简单的物体拾取 [2] 或点导航 [3] 这样的任务,机器人算法在家庭中的性能远低于实验室中的性能。如果我们希望构建能够解决更难、通用性更强的任务的机器人系统,我们将需要重新评估实验室机器人领域的许多基本假设。
In this work we present Dobb·E, a framework for teaching robots in homes by embodying three core principles: efficiency, safety, and user comfort. For efficiency, we embrace large-scale data coupled with modern machine learning tools. For safety, when presented with a new task, instead of trial-and-error learning, our robot learns from a handful of human demonstrations. For user comfort, we have developed an ergonomic demonstration collection tool, enabling us to gather task-specific demonstrations in unfamiliar homes without direct robot operation.
在这项工作中,我们介绍了Dobb·E,一个在家中教授机器人的框架,体现了三个核心原则:效率、安全和用户舒适。为了提高效率,我们采用大规模数据与现代机器学习工具相结合。为了安全,当面临新任务时,我们的机器人不是通过反复试错学习,而是通过少数人类演示进行学习。为了提高用户舒适度,我们开发了一种人体工程学的演示采集工具,使我们能够在陌生的家庭中收集特定任务的演示,而无需直接操作机器人。
Concretely, the key components of Dobb·E include:
具体而言,Dobb·E 的关键组件包括:
• Hardware: The primary interface is our demonstration collection tool, termed the “Stick.” It combines an affordable reacher-grabber with 3D printed components and an iPhone. Additionally, an iPhone mount on the robot facilitates direct data transfer from the Stick without needing domain adaptation.
• 硬件:主要接口是我们的演示采集工具,被称为“Stick”(手杖)。它结合了一个经济实惠的夹取器、3D打印的组件和一个iPhone。此外,机器人上的iPhone支架方便直接从Stick传输数据,无需进行域适应。
Pretraining Dataset: Leveraging the Stick, we amass a 13 hour dataset called Homes of New York (HoNY), comprising 5620 demonstrations from 216 environments in 22 New York homes, bolstering our system’s adaptability. This dataset serves to pretrain representation models for Dobb·E.
• 预训练数据集:利用Stick,我们积累了一个名为“Homes of New York(HoNY)”的13小时数据集,包括来自22个纽约家庭的216个环境中的5620个演示,增强了我们系统的适应性。该数据集用于为Dobb·E预训练表示模型。
Models and algorithms: Given the pretraining dataset we train a streamlined vision model, called Home Pretrained Representations (HPR), employing cutting-edge self-supervised learning (SSL) techniques. For novel tasks, a mere 24 demonstrations sufficed to finetune this vision model, incorporating both visual and depth information to account for 3D reasoning.
• 模型和算法:根据预训练数据集,我们训练了一个简化的视觉模型,称为“Home Pretrained Representations(HPR)”,采用先进的自监督学习(SSL)技术。对于新任务,仅仅通过24个演示就足以微调这个视觉模型,融合了视觉和深度信息以考虑3D推理。
Integration: Our holistic system, encapsulating hardware, models, and algorithms, is centered around a commercially available mobile robot: Hello Robot Stretch [4].
集成:我们的整体系统,包括硬件、模型和算法,围绕一款商用移动机器人展开:Hello Robot Stretch [4]。
Figure 2: (A) We design a new imitation learning framework, starting with a data collection tool. (B) Using this data collection tool, users can easily collect demonstrations for household tasks. © Using a similar setup on a robot, (D) we can transfer those demos using behavior cloning techniques to real homes.
图2:(A)我们设计了一个新的模仿学习框架,从一个数据采集工具开始。 (B)利用这个数据采集工具,用户可以轻松收集家务任务的演示。 (C)在机器人上使用类似的设置,(D)我们可以使用行为克隆技术将这些演示转移到真实的家庭中。
Figure 3: We ran experiments in a total of 10 homes near the New York City area, and successfully completed 102 out of 109 tasks that we tried. The figure shows a subset of 60 tasks, 6 tasks from 10 homes each, from our home robot experiments using Dobb·E.
图3:我们在纽约市附近的共10个家庭进行了实验,并成功完成了我们尝试的109项任务中的102项。图中展示了60个任务的子集,每个家庭中有10个任务,这些任务来自我们使用Dobb·E进行的家庭机器人实验。
We run Dobb·E across 10 homes spanning 30 days of experimentation, over which it tried 109 tasks and successfully learned 102 tasks with performance ≥ 50% and an overall success rate of 81%. Concurrently, extensive experiments run in our lab reveals the importance of many key design decisions. Our key experimental findings are:
我们在跨越30天的实验中在10个家庭中运行了Dobb·E,尝试了109项任务,并成功学习了102项任务,性能≥50%,整体成功率为81%。与此同时,我们在实验室进行的广泛实验揭示了许多关键设计决策的重要性。我们的主要实验性发现如下:
Surprising effectiveness of simple methods: Dobb·E follows a simple behavior cloning recipe for visual imitation learning using a ResNet model [5] for visual representation extraction and a two-layer neural network [6] for action prediction (see Section 2). On average, only using 91 seconds of data on each task collected over five minutes, Dobb·E can achieve a 81% success rate in homes (see Section 3).
简单方法的出乎意料的有效性:Dobb·E遵循一个简单的行为克隆配方,使用ResNet模型 [5] 进行视觉模仿学习,用于视觉表示提取,以及一个两层神经网络 [6] 进行动作预测(详见第2节)。平均而言,Dobb·E在每项任务上仅使用5分钟内收集的91秒数据,就能在家庭中达到81%的成功率(详见第3节)。
• Impact of effective SSL pretraining: Our foundational vision model, HPR trained on home data improves tasks success rate by at least 23% compared to other foundational vision models [7–9], which were trained on much larger internet datasets (see Section 3.4.1).
有效的自监督学习预训练的影响:我们的基础视觉模型HPR在家庭数据上的训练,相较于其他在更大的互联网数据集上训练的基础视觉模型 [7–9],至少提高了23%的任务成功率(详见第3.4.1节)。
Odometry, depth, and expertise: The success of Dobb·E is heavily reliant on the Stick providing highly accurate odometry and actions from the iPhones’ pose and position sensing, and depth information from the iPhone’s Lidar. Ease of collecting demonsrations also makes iterating on research problems with the Stick much faster and easier (see Section 3.4).
航迹、深度和专业知识的影响:Dobb·E的成功在很大程度上依赖于Stick提供的来自iPhone姿势和位置感测的高精度航迹和动作,以及来自iPhone激光雷达的深度信息。使用Stick收集演示的便利性还使得在Stick上迭代研究问题更加快速和容易(详见第3.4节)。
Remaining challenges: Hardware constraints such as the robot’s force, reach, and battery life, limit tasks our robot can physically solve (see Section 3.3.3), while our policy framework suffers with ambiguous sensing and more complex, temporally-extended tasks (see Sections 3.3.4, 4.1).
尚存的挑战:硬件约束,如机器人的力量、可达性和电池寿命,限制了我们的机器人可以在物理上解决的任务(详见第3.3.3节),而我们的策略框架在模糊的感知和更复杂、时间较长的任务方面存在问题(详见第3.3.4节、第4.1节)。
To encourage and support future work in home robotics, we have open-sourced our code, data, models, hardware designs, and are committed to supporting reproduction of our results. More information along with robot videos are available on our project website: https://dobb-e.com.
为了鼓励和支持未来的家庭机器人研究,我们已经开源了我们的代码、数据、模型、硬件设计,并致力于支持我们结果的再现。更多信息以及机器人视频可在我们的项目网站上找到:https://dobb-e.com。
2 Technical Components and Method
2 技术组件与方法
To create Dobb·E we partly build new robotic systems from first principles and partly integrate state-of-the-art techniques. In this section we will describe the key technical components in Dobb·E. To aid in reproduction of Dobb·E, we have open sourced all of the necessary ingredients in our work; please see Section 5 for more detail.
为了创建Dobb·E,我们部分地从头构建了新的机器人系统,部分地整合了最先进的技术。在本节中,我们将描述Dobb·E中的关键技术组件。为了方便再现Dobb·E,我们已经开源了我们工作中所有必要的组件;更多详细信息请参见第5节。
At a high level, Dobb·E is an behavior cloning framework [10]. Behavior cloning is a subclass of imitation learning, which is a machine learning approach where a model learns to perform a task by observing and imitating the actions and behaviors of humans or other expert agents. Behavior cloning involves training a model to mimic a demonstrated behavior or action, often through the use of labeled training data mapping observations to desired actions. In our approach, we pretrain a lightweight foundational vision model on a dataset of household demonstrations, and then in a new home, given a new task, we collect a handful of demonstrations and fine-tune our model to solve that task. However, there are many aspects of behavior cloning that we created from scratch or re-engineered from existing solutions to conform to our requirements of efficiency, safety, and user comfort.
在高层次上,Dobb·E是一个行为克隆框架 [10]。行为克隆是模仿学习的一个子类,是一种机器学习方法,模型通过观察和模仿人类或其他专家代理的动作和行为来执行任务。行为克隆涉及训练一个模型来模仿演示的行为或动作,通常通过使用标记的训练数据将观察映射到所需的动作。在我们的方法中,我们在家庭演示的数据集上预训练了一个轻量级的基础视觉模型,然后在一个新的家庭中,给定一个新任务,我们收集了少量演示,并微调我们的模型以解决该任务。然而,有许多行为克隆的方面是我们从头开始创建的,或者从现有解决方案中重新设计以符合我们对效率、安全性和用户舒适性的要求。
Our method can be divided into four broad stages: (a) designing a hardware setup that helps us in the collection of demonstrations and their seamless transfer to the robot embodiment, (b) collecting data using our hardware setup in diverse households, © pretraining foundational models on this data, and (d) deploying our trained models into homes.
我们的方法可以分为四个广泛的阶段:(a) 设计一个硬件设置,帮助我们收集演示并无缝地将它们传输到机器人实体,(b) 使用我们的硬件设置在不同家庭中收集数据,© 在这些数据上预训练基础模型,和 (d) 部署我们训练过的模型到家庭中。
2.1 Hardware Design
2.1 硬件设计
The first step in scaling robotic imitation to arbitrary households requires us to take a closer look at the standard imitation learning process and its inefficiencies. Two of the primary inefficiencies in current real-world imitation learning lay in the process of collecting the robotic demonstrations and transferring them across environments.
在将机器人模仿扩展到任意家庭的第一步要求我们更仔细地研究标准的模仿学习过程及其低效性。当前实际世界模仿学习中的两个主要低效性在于收集机器人演示和在环境之间传输这些演示的过程中。
Collecting robot demonstrations The standard approach to collect robot demonstrations in a robotic setup is to instrument the robot to pair it with some sort of remote controller device [11, 12], a full robotic exoskeleton [13–16], or simpler data collection tools [17–19]. Many recent works have used a video game controller or a phone [11], RGB-D cameras [20], or virtual reality device [12, 21, 22] to control the robot. Other works [23] have used two paired robots in a scene where one of the robots is physically moved by the demonstrator while the other robot is recorded by the cameras. However, such approaches are hard to scale up to households efficiently. Physically moving a robot is generally unwieldy, and for a home robotic task would require having multiple robots present at the site. Similarly, full exoskeleton based setups as shown in [13, 15, 16] are also unwieldy in a household setting. Generally, the hardware controller approach suffers from inefficiency because the human demonstrators have to map the controller input to the robot motion. Using phones or virtual reality devices are more efficient, since they can map the demonstrators’ movements directly to the robot. However, augmenting these controllers with force feedback is nearly impossible, often leading users to inadvertently apply extra force or torque on the robot. Such demonstrations frequently end up being unsafe, and the generally accepted solution to this problem is to limit the force and torque users can apply; however, this often causes the robot to diverge from the human behavior.
收集机器人演示:在机器人设置中收集机器人演示的标准方法是将机器人配置为与某种遥控器设备 [11, 12]、全身式机器人外骨骼 [13–16] 或更简单的数据采集工具 [17–19] 配对。许多最近的作品使用视频游戏控制器或手机 [11]、RGB-D摄像头 [20] 或虚拟现实设备 [12, 21, 22] 来控制机器人。其他作品 [23] 使用了一个场景中的两个配对机器人,其中一个机器人由演示者物理移动,而另一个机器人由摄像头记录。然而,这样的方法难以有效地扩展到家庭。物理移动机器人通常不便携,而对于家庭机器人任务,需要在现场准备多个机器人。类似地,像 [13, 15, 16] 中所示的基于全身外骨骼的设置在家庭环境中也不方便。通常,硬件控制器方法效率低下,因为人类演示者必须将控制器输入映射到机器人运动。使用手机或虚拟现实设备更有效,因为它们可以直接将演示者的动作映射到机器人。然而,几乎不可能使用这些控制器增加力反馈,通常导致用户无意中在机器人上施加额外的力或扭矩。这样的演示通常是不安全的,通常被接受的解决方案是限制用户可以施加的力和扭矩;然而,这通常会导致机器人偏离人类行为。
In this project, we take a different approach by trying to combine the versatility of mobile controllers with the intuitiveness of physically moving the robot. Instead of having the users move the entire robot, we created a facsimile of the Hello Robot Stretch end-effector using a cheap $25 reachergrabber stick that can be readily bought online, and augmented it ourselves with a 3D printed iPhone mount. We call this tool the “Stick,” which is a natural evolution of tools used in prior work [19, 24] (see Figure 4).
在这个项目中,我们采取了一种不同的