Yuu Jinnai 1 David Abel 1 D Ellis Hershkowitz 2 Michael L. Littman 1 George Konidaris 1
Abstract
We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning con- verges in less than a given maximum of value- iteration passes. We first show that the problem is NP-hard, even if the task is constrained to be deterministic—the first such complexity result for option discovery. We then present the first polynomial-time boundedly suboptimal approximation algorithm for this setting, and empirically evaluate it against both the optimal options and a representative collection of heuristic approaches in simple grid-based domains including the clas- sic four-rooms problem.
我们将选择用于计划的最佳选项集的问题正式化为计算最小选项集的问题,以便计划收集的值小于给定的最大值 - 迭代次数。 我们首先表明问题是NP难的,即使任务被限制为确定性 - 选项发现的第一个这样的复杂性结果。 然后,我们为此设置提出第一个多项式时间有界次优近似算法,并根据最优选项和基于网格的简单域中的启发式方法的代表性集合(包括典型的四室问题)进行实证评估。
1. Introduction
Markov Decision Processes or MDPs (Puterman, 1994) are an expressive yet simple model of sequential decision- making environments. However, MDPs are computation- ally expensive to solve (Papadimitriou & Tsitsiklis, 1987; Littman, 1997; Goldsmith et al., 1997). One approach to solving such problems is to add high-level, temporally ex- tended actions—often formalized as options (Sutton et al., 1999)—to the action space. The right set of options allows planning to probe more deeply into the search space with a single computation. Thus, if options are chosen appro- priately, planning algorithms can find good plans with less computation.
Indeed, previous work has offered substantial support that abstract actions can accelerate planning (Mann & Man- nor, 2014). However, little is known about how to find the right set of options to use for planning. Prior work of- ten seeks to codify an intuitive notion of what underlies an effective option, such as identifying relatively novel states (S¸ ims¸ek & Barto, 2004), identifying bottleneck states or high-betweenness states (S¸ ims¸ek et al., 2005; S¸ ims¸ek & Barto, 2009; Bacon, 2013; Moradi et al., 2012), finding re- peated policy fragments (Pickett & Barto, 2002), or finding states that often occur on successful trajectories (McGov- ern & Barto, 2001; Bakker & Schmidhuber, 2004). While such intuitions often capture important aspects of the role of options in planning, the resulting algorithms are somewhat heuristic in that they are not based on optimizing any pre- cise performance-related metric; consequently, their relative performance can only be evaluated empirically.
We aim to formalize what it means to find the set of op- tions that is optimal for planning, and to use the resulting formalization to develop an approximation algorithm with a principled theoretical foundation. Specifically, we consider the problem of finding the smallest set of options so that planning converges in fewer than a given maximum of A iterations of the planning algorithm, value iteration (VI). Our main result shows that this problem is NP-hard. More precisely, the problem:
马尔可夫决策过程或MDP(Puterman,1994)是一种表达而简单的顺序决策环境模型。然而,MDP的计算成本很高(Papadimitriou&Tsitsiklis,1987; Littman,1997; Goldsmith等,1997)。解决这些问题的一种方法是增加高级的,暂时扩展的行动 - 通常被形式化为选项(Sutton等,1999) - 行动空间。正确的选项集允许计划通过单次计算更深入地探测搜索空间。因此,如果选择适当的选项,计划算法可以找到计算较少的好计划。
事实上,以前的工作为抽象行动加速规划提供了大量支持(Mann&Man-nor,2014)。但是,对如何找到用于规划的正确选项集知之甚少。之前的工作旨在编纂一个有效选项基础的直观概念,例如识别相对新颖的状态(S¸ims¸ek&Barto,2004),识别瓶颈状态或高度中介状态(S¸ims¸ek) et al。,2005;S¸ims¸ek&Barto,2009; Bacon,2013; Moradi et al。,2012),找到重复的政策片段(Pickett&Barto,2002),或找到经常在成功时发生的状态轨迹(McGovern&Barto,2001; Bakker&Schmidhuber,2004)。虽然这种直觉经常捕捉选项在规划中的作用的重要方面,但是得到的算法在某种程度上是启发式的,因为它们不是基于优化任何与性能相关的精确度量;因此,他们的相对表现只能凭经验进行评估。
我们的目标是形成如何找到最适合规划的选项集,并使用由此产生的形式化来开发具有原则理论基础的近似算法。具体而言,我们考虑找到最小选项集的问题,以便计划收敛于规划算法的值迭代(VI)的A次迭代的给定最大值。我们的主要结果表明这个问题是NP难的。更准确地说,问题是:
In Section 4, we show A-MOMI, a polynomial-time approx- imation algorithm that has O(n) suboptimality in general and O(log n) suboptimality for deterministic MDPs. The expression 2log1−g n is only slightly smaller than n: if s = 0 then Ω(2log n) = Ω(n). Thus, the inapproximability results claim that A-MOMI is close to the best possible approximation factor.
In addition, we consider the complementary problem of finding a set of k options that minimize the number of VI iterations until convergence. We show that this problem is also NP-hard, even for a deterministic MDP.
Finally, we empirically evaluate the performance of two heuristic approaches for option discovery, betweenness op- tions (S¸ ims¸ek & Barto, 2009) and Eigenoptions (Machado et al., 2017), against the proposed approximation algorithms and the optimal options in standard grid domains.
2.Background
We first provide background on Markov Decision Processes (MDPs), planning, and options.
Markov Decision Processes and Planning