Combining policy gradient and Q-learning

本文提出了一种结合策略梯度与离策略Q学习的新方法——PGQL,该方法通过从经验回放缓冲区中抽取数据进行更新,实现了策略梯度方法对离策略数据的有效利用。此外,还建立了动作价值拟合技术和演员-评论家算法之间的等价性,并展示了正则化策略梯度方法可以被解释为优势函数学习算法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as 'PGQL', for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.
Subjects: Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
Cite as: arXiv:1611.01626 [cs.LG]
  (or arXiv:1611.01626v3 [cs.LG] for this version)

Submission history

From: Brendan O'Donoghue [ view email
[v1] Sat, 5 Nov 2016 10:49:37 GMT (1094kb,D)
[v2] Mon, 6 Mar 2017 12:38:42 GMT (892kb,D)
[v3] Fri, 7 Apr 2017 15:20:05 GMT (893kb,D)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值