【一种多目标强化学习和策略自适应的通用算法】

资源存储库

已于 2024-03-30 15:54:49 修改

阅读量1.3k

点赞数 15

分类专栏：多智能体强化学习文章标签：算法

于 2024-03-30 15:52:36 首次发布

本文链接：https://blog.youkuaiyun.com/wq6qeg88/article/details/137174269

版权

本文提出了一种新的多目标强化学习（MORL）算法，旨在实现对新任务的少量适应。该算法学习单一策略网络，优化整个偏好空间，允许在不同偏好下执行最优策略。实验表明，该方法在多个领域显著优于传统方法，包括对话策略学习和超级马里奥游戏。此外，该算法还能自动推断新任务的隐藏偏好。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation
一种多目标强化学习和策略自适应的通用算法

Abstract 摘要

We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After an initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.1
我们引入了一种新的具有线性偏好的多目标强化学习（MORL）算法，其目标是实现对新任务的少量适应。在MORL中，目标是学习多个竞争目标的策略，这些目标的相对重要性（偏好）对智能体来说是未知的。虽然这简化了对标量奖励设计的依赖，但策略的预期回报可能会随着偏好的变化而发生显着变化，这使得学习单个模型以在不同偏好条件下产生最优政策变得具有挑战性。我们提出了一个广义版本的贝尔曼方程学习一个单一的参数表示的所有可能的偏好空间的最优策略。在初始学习阶段之后，我们的代理可以在任何给定的偏好下执行最优策略，或者用很少的样本自动推断出潜在的偏好。在四个不同领域的实验证明了我们的方法的有效性。

1 Introduction

In recent years, there has been increased interest in the paradigm of multi-objective reinforcement learning (MORL), which deals with learning control policies to simultaneously optimize over several criteria. Compared to traditional RL, where the aim is to optimize for a scalar reward, the optimal policy in a multi-objective setting depends on the relative preferences among competing criteria. For example, consider a virtual assistant (Figure 1) that can communicate with a human to perform a specific task (e.g., provide weather or navigation information). Depending on the user’s relative preferences between aspects like success rate or brevity, the agent might need to follow completely different strategies. If success is all that matters (e.g., providing an accurate weather report), the agent might provide detailed responses or ask several follow-up questions. On the other hand, if brevity is crucial (e.g., while providing turn-by-turn guidance), the agent needs to find the shortest way to complete the task. In traditional RL, this is often a fixed choice made by the designer and incorporated into the scalar reward. While this suffices in cases where we know the preferences of a task beforehand, the learned policy is limited in its applicability to scenarios with different preferences. The MORL framework provides two distinct advantages – (1) reduced dependence on scalar reward design to combine different objectives, which is both a tedious manual task and can lead to unintended consequences [1], and (2) dynamic adaptation or transfer to related tasks with different preferences.
近年来，人们对多目标强化学习（MORL）范式的兴趣越来越大，MORL涉及学习控制策略以同时优化多个标准。与传统的RL相比，其目标是优化标量奖励，多目标设置中的最优策略取决于竞争标准之间的相对偏好。

例如，考虑可以与人类通信以执行特定任务（例如，提供天气或导航信息）。根据用户在成功率或简洁性等方面的相对偏好，智能体可能需要遵循完全不同的策略。如果成功是最重要的（例如，提供准确的天气报告），智能体可能会提供详细的响应或提出几个后续问题。另一方面，如果简洁是至关重要的（例如，在提供逐向引导的同时），智能体需要找到完成任务的最短路径。

在传统的RL中，这通常是设计者做出的固定选择，并被纳入标量奖励中。虽然这在我们事先知道任务偏好的情况下就足够了，但学习的策略在具有不同偏好的场景中的适用性有限。MORL框架提供了两个明显的优势-（1）减少了对标量奖励设计的依赖，以组合联合收割机不同的目标，这既是一项繁琐的手动任务，也可能导致意想不到的后果[1]，以及（2）动态适应或转移到具有不同偏好的相关任务。

Figure 1:Task-oriented dialogue policy learning is a real-life example of unknown linear preference scenario. Users may expect either briefer dialogue or more informative dialogue depending on the task.
图1：面向任务的对话策略学习是未知线性偏好场景的一个真实例子。用户可能期望更简短的对话或更多的信息对话，这取决于任务。

However, learning policies over multiple preferences under the MORL setting has proven to be quite challenging, with most prior work using one of two strategies [2]. The first is to convert the multi-objective problem into a single-objective one through various techniques [3, 4, 5, 6] and use traditional RL algorithms. These methods only learn an ‘average’ policy over the space of preferences and cannot be tailored to be optimal for specific preferences. The second strategy is to compute a set of optimal policies that encompass the entire space of possible preferences in the domain [7, 8, 9]. The main drawback of these approaches is their lack of scalability – the challenge of representing a Pareto front (or its convex approximation) of optimal policies is handled by learning several individual policies, which can grow significantly with the size of the domain.
然而，MORL设置下的多偏好学习策略已被证明是相当具有挑战性的，大多数先前的工作使用两种策略之一[2]。第一种是通过各种技术[3，4，5，6]将多目标问题转换为单目标问题，并使用传统的RL算法。