Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

最新推荐文章于 2025-05-29 16:01:09 发布

原创最新推荐文章于 2025-05-29 16:01:09 发布 · 1.5k 阅读

1 ·

CC 4.0 BY-SA版权

Hierarchical Reinforcement Learning 专栏收录该内容

3 篇文章

订阅专栏

本文介绍了一种名为hierarchical-DQN(h-DQN)的框架，该框架将不同时间尺度上的层级价值函数整合到由内在动机驱动的深度强化学习中。通过两个层级的模块——顶层模块（元控制器）选择目标，底层模块（控制器）根据当前状态和选定目标执行动作，直到目标达成或回合结束。实验表明，这种方法在具有延迟奖励的ATARI游戏中表现出了优越的探索能力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Research Topic

Learning goal-directed behavior in environments with sparse feedback is a major challenge for reinforcement learning algorithms.
这里有两个名词需要注意：goal-directed behavior, sparse feedback

这篇文章提出了一种hierarchical-DQN (h-DQN), a framework to integrate hierarchical value functions, operating at different temporal scales, with intrinsically motivated deep reinforcement learning.

The model takes decisions over two levels of hierarchy:

the top level module (meta-controller)
takes in the state and picks a new goal
the lower-level module (controller)
uses both the state and the chosen goal to select actions either until the goal is reached or the episode is terminated.

In their work, they propose a scheme for temporal abstraction that involves simultaneously learning options and a control policy to compose options in a deep reinforcement learning setting.

这里有必要对intrinsic motivation和extrinsic motivation进行解释一下，这其实都是心理学名词：

intrinsic motivation
使用内部评价体系的人，对别人的评价不大在乎，他们做事情的动力，是来自于自己内心, 内在动机提供了一个促进学习和发展的自然力量，它在没有外在奖赏和压力的情况下，可以激发行为。
extrinsic motivation
使用外部评价体系的人，对别人的评价特别在乎，甚至会内化别人对自己的评价，认为自己就是这样的。这样的人他们在做事情时，首先考虑的，也是别人怎么看、怎么认为。他们做事情的动力，常是为了博取别人的认可、金钱等

现在的强化学习对agent的研究基本都集中在外部动机上，一般认为外部强化是激发外部动机的必要条件，在强化条件下个体会产生对下一步强化的期待，从而以获得外部强化作为个体行为的目标。

Model

agents

现在的exploration method（e.g. $ϵ−greedy\epsilon-greedy$ ）只对local exploration有用，但是fail to provide provide impetus for the agent to explore different areas of the state space.
因此，为了解决这个问题，引入了一个重要的概念——goals
Goals provide intrinsic motivation for the agent. The agent focuses on setting and achieving sequences of goals in order to maximize cumulative extrinsic reward.

use temporal abstraction of options to define policy $πg\pi_{g}$ for each goal $g$

其实本文的目标就是有两个：

learning option policy
learning the optimal sequence of goals to follow

Temporal Abstraction

as below:
在这里插入图片描述
critic的作用：
The internal critic is responsible for evaluating whether a goal has been reached and providing an appropriate reward $r_{t}(g)$ to the controller.
The intrinsic reward functions are dynamic and temporally dependent on the sequential history of goals.

Deep Reinforcement Learning with Temporal Abstraction

这篇文章使用了deep Q-Learning framework to learn policies for both the controller and the meta-controller.

the controller estimates the following Q-value function:
the meta-controller estimates the following Q-value function:

It is important to note that the transitions $s_{t}, g_{t}, f_{t}, s_{t+N})$ generate by $Q_{2}$ run at a slower time-scale than the transitions $s_{t}, a_{t}, g_{t}, r_{t}, s_{t+1})$ generate by $Q_{1}$

Learning Algorithm

Parameters of h-DQN are learned using stochastic gradient descent at different time-scales.
在这里插入图片描述

Experiments

ATARI game with delayed rewards

Model Architecture

在这里插入图片描述
The internal critic is defined in the space of $< e n t i t y 1, r e l a t i o n, e n t i t y 2 >$ , where relation is a function over configurations of the entities.

Training Procedure

First Phase
set the exploration parameter $ϵ2\epsilon_{2}$ of the meta-controller to 1 and train the controller on actions. This effectively leads to pre-training the controller so that it can learn to solve a subset of the goals.
Second Phase
jointly train the controller and meta-controller