ABSTRACT
A crucial capability of real-world intelligent agents is their ability to plan a sequence of actions to achieve their goals in the visual world. In this work, we address the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state. Doing so entails knowledge about objects and their affordances, as well as actions and their preconditions and effects. We propose learning these through interacting with a visual and dynamic environment. Our proposed solution involves bootstrapping reinforcement learning with imitation learning. To ensure cross task generalization, we develop a deep predictive model based on successor representations. Our experimental results show near optimal results across a wide range of tasks in the challenging THOR environment.
现实世界中智能agent的一个关键能力是,它们能够计划一系列行动,以实现其在可视世界中的目标。在这项工作中,我们解决了视觉语义规划的问题:从视觉观察中预测一系列动作的任务,这些动作将动态环境从初始状态转换为目标状态。这样做需要了解对象及其启示,以及行动及其先决条件和效果。我们建议通过与视觉和动态环境的互动来学习这些。我们提出的解决方案包括用模仿学习引导强化学习。为了保证跨任务的泛化,我们开发了一个基于后继表示的深度预测模型。我们的实验结果表明,在具有挑战性的THOR环境中,通过广泛的任务,我们得到了接近最优的结果。
AGENT
Agent的概念由Minsky在其1986年出版的《思维的社会》一书中提出。Minsky认为社会中的某些个体经过协商之后可求得问题的解,这些个体就是Agent。他还认为Agent应具有社会交互性和智能性。从此,Agent的概念便被引入人工智能和计算机领域,并迅速成为研究热点。
THOR
AI2-THOR是由艾伦人工智能研究所(AI2)、斯坦福大学、卡耐基梅隆大学、华盛顿大学、南加州大学合作完成的。它为人工智能Agent提供了一个室内装修效果图画风的世界,高度仿真,Agent可以和里面的各种家具家电交互——比如说打开冰箱、推倒椅子、把电脑放在桌子上等等。
为了让Agent与场景的交互尽可能接近真实,AI2-THOR除了包含表面上能看到的高质量3D场景之外,背后还有Unity 3D引擎,能让其中的物体遵循现实世界的物理规则来运动,也就是让交互动作尽可能真实。
另外,AI2-THOR还提供Python API。[1]
1 INTRODUCTION
Humans demonstrate levels of visual understanding that go well beyond current formulations of mainstream vision tasks (e.g. object detection, scene recognition, image segmentation). A key element to visual intelligence is the ability to interact with the environment and plan a sequence of actions to achieve specific goals; This, in fact, is central to the survival of agents in dynamic environments
人类表现出的视觉理解水平远远超出当前主流视觉任务(如目标检测、场景识别、图像分割)的公式。
视觉智能的一个关键要素是与环境互动的能力,以及为实现特定目标而计划一系列行动的能力;
事实上,这对于动态环境中代理的生存至关重要
Visual semantic planning, the task of interacting with a visual world and predicting a sequence of actions that achieves a desired goal, involves addressing several challenging problems. For example, imagine the simple task of putting the bowl in the microwave in the visual dynamic environment depicted in Figure 1. A successful plan involves first finding the bowl, navigating to it, then grabbing it, followed by finding and navigating to the microwave, opening the microwave, and finally putting the bowl in the microwave.
视觉语义规划的任务是与视觉世界进行交互,并预测实现预期目标的一系列动作,涉及解决几个具有挑战性的问题。例如,想象一下在图1所示的可视化动态环境中将碗放入微波炉的简单任务。一个成功的计划包括首先找到碗,导航到它,然后抓住它,然后找到并导航到微波炉,打开微波炉,最后把碗放入微波炉。
The first challenge in visual planning is that performing each of the above actions in a visual dynamic environment requires deep visual understanding of that environment, including the set of possible actions, their preconditions and effects, and object affordances. For example, to open a microwave an agent needs to know that it should be in front of the microwave, and it should be aware of the state of the microwave and not try to open an already opened microwave. Long explorations that are required for some tasks imposes the second challenge. The variability of visual observations and possible actions makes na ıve exploration intractable. To find a cup, the agent might need to search several cabinets one by one. The third challenge is emitting a sequence of actions such that the agent ends in the goal state and the effects of the preceding actions meet the preconditions of the proceeding ones. Finally, a satisfactory solution to visual planning should enable cross task transfer; previous knowledge about one task should make it easier to learn the next one. This is the fourth challenge.
视觉规划的第一个挑战是,在视觉动态环境中执行上述每个操作都需要对该环境有深入的视觉理解,包括一组可能的操作、它们的先决条件和效果,以及对象的可提供性。例如,要打开微波,一个代理人需要知道它应该在微波前面,它应该知道微波的状态,而不是试图打开一个已经打开的微波。第二个挑战是一些任务需要长时间的探索。视觉观察和可能的行动的可变性使得单纯的探索变得棘手。要找到一个杯子,代理可能需要一个一个地搜索几个柜子。第三个挑战是发出一系列动作,使代理以目标状态结束,并且前面动作的效果满足前面动作的先决条件。最后,视觉规划的一个令人满意的解决方案应该能够实现跨任务转移;以前对一项任务的了解应该会使学习下一项任务变得更容易。这是第四个挑战。
In this paper, we address visual semantic planning as a policy learning problem. We mainly focus on high-level actions and do not take into account the low-level details of motor control and motion planning.
在本文中,我们将视觉语义规划视为一个策略学习问题。
我们主要关注高级动作,不考虑电机控制和运动规划的低级细节。
Visual Semantic Planning (VSP) is the task of predicting a sequence of semantic actions that moves an agent from a random initial state in a visual dynamic environment to a given goal state.
视觉语义规划(Visual Semantic Planning, VSP)是预测一系列语义动作的任务,这些动作将代理从视觉动态环境中的随机初始状态移动到给定的目标状态。
To address the first challenge, one needs to find a way to represent the required knowledge of objects, actions, and the visual environment. One possible way is to learn these from still images or videos [12, 51, 52]. But we argue that learning high-level knowledge about actions and their preconditions and effects requires an active and prolonged interaction with the environment. In this paper, we take an interaction-centric approach where we learn this knowledge through interacting with the visual dynamic environment. Learning by interaction on real robots has limited scalability due to the complexity and cost of robotics systems [39, 40, 49]. A common treatment is to use simulation as mental rehearsal before real-world deployment [4, 21, 26, 53, 54]. For this purpose, we use the THOR framework [54], extending it to enable interactions with objects, where an action is specified as its pre- and post-conditions in a formal language.
要解决第一个挑战,需要找到一种方法来表示对象、操作和视觉环境所需的知识。一种可能的方法是从静态图像或视频中学习这些[12,51,52]。但我们认为,学习行动及其先决条件和效果的高级知识需要与环境进行积极和长期的互动。在本文中,我们采用了一种以交互为中心的方法,通过与可视化动态环境的交互来学习这些知识。由于机器人系统的复杂性和成本,在真实机器人上通过交互学习的可扩展性有限[39,40,49]。一种常见的处理方法是在实际部署之前使用模拟作为心理演练[4,21,26,53,54]。为此,我们使用了THOR框架[54],对其进行了扩展,以支持与对象的交互,其中一个操作在正式语言中指定为其前置和后置条件。
To address the second and third challenges, we cast VSP as a policy learning problem, typically tackled by reinforcement learning [11, 16, 22, 30, 35, 46]. To deal with the large action space and delayed rewards, we use imitation learning to bootstrap reinforcement learning and to guide exploration. To address the fourth challenge of cross task generalization [25], we develop a deep predictive model based on successor representations [7, 24] that decouple environment dynamics and task rewards, such that knowledge from trained tasks can be transferred to new tasks with theoretical guarantees [3].
为了解决第二个和第三个挑战,我们将VSP定义为一个策略学习问题,通常通过强化学习来解决[11,16,22,30,35,46]。针对行动空间大、奖励延迟的问题,我们采用模仿学习来引导强化学习,引导探索。为了解决交叉任务泛化[25]的第四个挑战,我们开发了一个基于后继表示的深度预测模型[7,24],该模型解耦了环境动力学和任务奖励,使得训练任务的知识可以在理论保证[3]的前提下转移到新的任务中。
In summary, we address the problem of visual semantic planning and propose an interaction-centric solution. Our proposed model obtains near optimal results across a spectrum of tasks in the challenging THOR environment. Our results also show that our deep successor representation offers crucial transferability properties. Finally, our qualitative results show that our learned representation can encode visual knowledge of objects, actions, and environments.
总之,我们解决了可视化语义规划的问题,并提出了一个以交互为中心的解决方案。我们所提出的模型在THOR环境的挑战下,在一系列任务中获得了接近最优的结果。我们的结果还表明,我们的深层继承表示提供了关键的可转移性。最后,我们的定性结果表明,我们的学习表示可以编码对象、动作和环境的视觉知识。
2. Related Work
Task planning. Task-level planning [10, 13, 20, 47, 48] addresses the problem of finding a high-level plan for performing a task. These methods typically work with highlevel formal languages and low-dimensional state spaces. In contrast, visual semantic planning is particularly challenging due to the high dimensionality and partial observability of visual input. In addition, our method facilitates generalization across tasks, while previous methods are typically designed for a specific environment and task.
任务计划。任务级计划[10,13,20,47,48]解决了为执行任务寻找高层次计划的问题。这些方法通常使用高级形式语言和低维状态空间。相比之下,由于视觉输入的高维性和部分可观测性,视觉语义规划就显得尤为具有挑战性。此外,我们的方法促进了任务间的泛化,而以前的方法通常是针对特定的环境和任务设计的。
Perception and interaction. Our work integrates perception and interaction, where an agent actively interfaces with the environment to learn policies that map pixels to actions. The synergy between perception and interaction has drawn an increasing interest in the vision and robotics community. Recent work has enabled faster learning and produced more robust visual representations [1, 32, 39] through interaction. Some early successes have been shown in physical understanding [9, 26, 28, 36] by interacting with an environment.
感知和交互。我们的工作集成了感知和交互,其中一个代理主动地与环境交互,以学习将像素映射到操作的策略。感知和交互之间的协同作用在视觉和机器人领域引起了越来越多的兴趣。最近的研究使学习变得更快,并通过交互产生了更健壮的视觉表示[1,32,39]。一些早期的成功已经通过与环境的互动在物理理解[9,26,28,36]方面得到了证明。
Deep reinforcement learning. Recent work in reinforcement learning has started to exploit the power of deep neural networks. Deep RL methods have shown success in several domains such as video games [35], board games [46], and continuous control [30]. Model-free RL methods (e.g., [30, 34, 35]) aim at learning to behave solely from actions and their environment feedback; while model-based RL approaches (e.g., [8, 44, 50]) also estimate a environment model. Successor representation (SR), proposed by Dayan [7], can be considered as a hybrid approach of model-based and model-free RL. Barreto et al. [3] derived a bound on value functions of an optimal policy when transferring policies using successor representations. Kulkarni et al. [24] proposed a method to learn deep successor features. In this work, we propose a new SR architecture with significantly reduced parameters, especially in large action spaces, to facilitate model convergence. We propose to first train the model with imitation learning and fine-tune with RL. It enables us to perform more realistic tasks and offers significant benefits for transfer learning to new tasks.
深度强化学习。最近的强化学习研究已经开始利用深度神经网络的力量。深RL方法在电子游戏[35]、棋盘游戏[46]、连续控制[30]等多个领域取得了成功。无模型的RL方法(如[30,34,35])的目的是学习仅仅从行为及其环境反馈行为;而基于模型的RL方法(如[8,44,50])也可以估计环境模型。由Dayan[7]提出的后继表示(SR)可以看作是基于模型和无模型RL的混合方法。Barreto等人在使用后继表示传递策略时,推导出最优策略的值函数的一个界。Kulkarni等人提出了一种学习深层后继特征的方法。在这项工作中,我们提出了一种新的SR架构,其参数显著减少,特别是在大型动作空间中,以促进模型的收敛。我们建议先用模仿学习训练模型,再用RL进行微调。它使我们能够执行更现实的任务,并为将学习转移到新任务中提供了显著的好处。
Learning from demonstrations. Expert demonstrations offer a source of supervision in tasks which must usually be learned with copious random exploration. A line of work interleaves policy execution and learning from expert demonstration that has achieved good practical results [6, 43]. Recent works have employed a series of new techniques for imitation learning, such as generative adversarial networks [19, 29], Monte Carlo tree search [17] and guided policy search [27], which learn end-to-end policies from pixels to actions.
从示范中学习。专家示范提供了一个监督任务的来源,这些任务通常必须通过大量的随机探索来学习。将策略执行与专家论证相结合的工作路线,取得了良好的实践效果[6,43]。近年来的研究工作采用了生成对抗网络[19,29]、蒙特卡罗树搜索[17]和引导策略搜索[27]等一系列模仿学习新技术,从像素点到动作点学习端到端的策略。
Synthetic data for visual tasks. Computer games and simulated platforms have been used for training perceptual tasks, such as semantic segmentation [18], pedestrian detection [33], pose estimation [38], and urban driving [5, 41, 42, 45]. In robotics, there is a long history of using simulated environments for learning and testing before real-world deployment [23]. Several interactive platforms have been proposed for learning control with visual inputs [4, 21, 26, 53, 54]. Among these, THOR [54] provides high-quality realistic indoor scenes. Our work extends THOR with a new set of actions and the integration of a planner.
视觉任务的合成数据。利用计算机游戏和仿真平台训练感知任务,如语义分割[18]、行人检测[33]、姿态估计[38]、城市驾驶等[5,41,42,45]。在机器人技术中,在实际部署[23]之前,使用模拟环境进行学习和测试已有很长的历史。已经提出了几种使用视觉输入进行学习控制的交互平台[4,21,26,53,54]。其中,THOR[54]提供了高质量的逼真的室内场景。我们的工作扩展了THOR的一套新的行动和一体化的规划。
3 Interactive Framework
交互框架
To enable interactions with objects and with the environment, we extend the THOR framework [54], which has been used for learning visual navigation tasks. Our extension in Figure 2. Example images that demonstrate the state changes before and after an object interaction from each of the six action types in our framework. Each action changes the visual state and certain actions may enable further interactions such as opening the fridge before taking an object from it.
为了支持与对象和环境的交互,我们扩展了THOR框架[54],该框架用于学习可视化导航任务。
图2中的扩展。
示例图像展示了框架中六种操作类型的对象交互前后的状态变化。
每个动作都会改变视觉状态,某些动作可能会使进一步的交互成为可能,比如在从冰箱中拿东西之前打开冰箱。
cludes new object states, and a discrete description of the scene in a planning language
包括新的对象状态,和一个离散的描述场景的规划语言
3.1 Scenes
场景
In this work, we focus on kitchen scenes, as they allow for a variety of tasks with objects from many categories. Our extended THOR framework consists of 10 individual kitchen scenes. Each scene contains an average of 53 distinct objects with which the agent can interact. The scenes are developed using the Unity 3D game engine.
在这项工作中,我们