Deep Q-Learning(DQN)

 


1、绪论

论文《Playing Atari with Deep Reinforcement Learning》 链接

        早些时候,从视觉和语言等高维感官输入来控制主体的一些RL应用都依赖于与线性值函数或策略表示相结合的手动设计的特征。现在,在深度学习方面的进展使得从原始的感官数据中提取高级特征成为可能。这些方法利用了一系列的神经网络结构,包括卷积网络、多层感知器、受限玻尔兹曼机和递归神经网络,并利用了监督和非监督学习。

1)DQN是Q-Learning与深度学习相结合:

  1. 使用了 CNN 来构建值函数逼近器
  2. 使用了经验回放机制 —— experience replay mechanism

2)从深度学习的角度来看,强化学习存在一些挑战

  1. DL需要大量带标签的样本进行监督学习;RL算法必须能够从稀疏的、有噪声的和延迟的标量奖励信号中学习。
  2. 大多数深度学习算法假设数据样本是独立的,而在强化学习中,通常会遇到高度相关的状态序列
  3. DL目标分布固定;RL的分布一直变化。
  4. 使用非线性网络表示值函数时出现不稳定等问题。(2015)

3)如何解决:

  1. 通过Q-Learning使用reward来构造标签(对应问题1)
  2. 通过experience replay的方法来解决相关性及非静态分布问题(对应问题2、3)
  3. 使用一个CNN(MainNet)产生当前Q值,使用另外一个CNN(Target)产生Target Q值(对应问题4)(2015)

 


2、公式定义

        我们以动作、观察和奖励为序列来考虑 agent 与环境交互的任务。在每个时间步,代理从动作集 A = { 1 , . . . , K } A = \{1,...,K\} A={1,...,K}中选择一个动作 a t a_t at 。执行这个动作之后,环境修改其状态,并且获得一个奖励。

        论文中使用的环境是 Atari emulator,也就是一些简单的 Atari 游戏,状态是模拟器当前的图片, x t ∈ R d x_t \in \R^d xtRd,表示当前屏幕的原始像素值的向量。奖励 r t r_t rt 表示游戏分数的变化。

在这里插入图片描述

        由于代理只观察当前屏幕的图像,因此任务是部分观察的,即仅从当前的 x t x_t xt 是不可能完全了解当前情况的。因此,考虑一组动作和观察的序列: s t = x 1 , a 1 , x 2 , . . . , a t − 1 , x t s_t = x_1,a_1,x_2,...,a_{t-1},x_t st=x1,a1,x2,...,at1,xt 。使用这些序列来学习游戏策略。

注:假设仿真器中的所有序列都在有限的时间步长内终止。这种形式产生了一个大型但有限的马尔可夫决策过程(MDP),其中每个序列都是一个不同的状态。最终,使用这些序列作为状态,然后就能使用标准的强化学习了。

        代理的目标是通过以最大化未来回报的方式选择动作来与仿真器交互。同时,设置了一个折扣因子: γ \gamma γ 。定义时间 t 的未来折扣回报:
在这里插入图片描述
定义最优动作值函数: Q ∗ ( s , a ) Q^*(s,a) Q(s,a)
在这里插入图片描述
最优动作值函数遵循 Bellman 方程:
在这里插入图片描述

        对于简单的RL应用,可以通过使用 Bellman 方程作为迭代更新来估计动作值函数。这里使用函数逼近器来估计动作值函数: Q ( s , a ; θ ) = Q ∗ ( s , a ) Q(s,a;\theta) = Q^*(s,a) Q(s,a;θ)=Q(s,a)

        在强化学习中,通常可以是一个线性函数逼近器,但有时也使用非线性函数逼近器,如神经网络。我们把具有权重 θ \theta θ 的神经网络函数逼近器称为 Q-network

Q-network 可以通过最小化损失函数 L i ( θ i ) L_i(\theta_i) Li(θi) 来训练:
在这里插入图片描述
其中:
在这里插入图片描述
ρ(s,a) 是一个的概率分布,我们称之为行为分布

损失函数的梯度:
在这里插入图片描述
可以利用随机梯度下降法优化损失函数。

 


3、深度强化学习

        目标是将强化学习算法与直接作用于RGB图像并利用随机梯度更新有效处理训练数据的深度神经网络连接起来。

1)重要技术:

experience replay:把每个时间步的 agent 经验 e t = ( s t , a t , r t , s t + 1 ) e_t = (s_t,a_t,r_t,s_{t+1}) et=(st,at,rt,st+1) 存入 replay memory 中。用于解决相关性及非静态分布问题。

2)Deep Q-learning优点:

  1. 经验的每一步都可能用于许多权重更新,从而提高了数据效率。
  2. 直接从连续样本学习是低效的,因为样本之间的相关性很强;随机化样本打破了这些相关性,因此减少了更新的方差。
  3. 当学习on-policy时,当前的参数决定了下一个数据样本,可能会偏离,并且陷入局部最小。通过使用经验回放,对行为分布的许多先前的状态进行平均,学习更加平滑,避免参数的振荡或发散。
  4. End-to-End 训练方式.

 


4、DQN算法伪代码

在这里插入图片描述

 


5、预处理和模型架构

1)预处理

        直接使用 Atari 原始帧,即 210x160 像素、128调色板的图像,会有很高的计算要求,因此应用了一个基本的预处理步骤,目的是降低输入的维数。

        对原始帧进行预处理,首先将其RGB表示转换为灰度,然后将其降采样为 110x84 图像。最终的输入表示是通过裁剪大致捕捉到游戏区域的图像的 84x84 区域得到的。最后的裁剪阶段是必需的,因为使用了2D卷积的GPU实现,它需要正方形的输入。算法1中的函数 ϕ \phi ϕ 将这种预处理应用到历史记录的最后4帧,并将它们堆叠起来,产生 Q-function 的输入。

2)模型架构

        以前 Q 将历史动作对映射为其Q值的标量估计,历史和动作被用作神经网络的输入。这种架构的主要缺点是需要一个单独的前向传递来计算每个动作的q值,从而导致成本与动作的数量成线性比例。

现在使用新的模型架构,每个可能的动作都有一个单独的输出单元,只有状态表示是神经网络的输入。输出对应于输入状态单个动作的预测q值。这种类型架构的主要优点是计算给定状态下所有可能动作的q值,只需通过网络进行一次前向传递。

 


6、实验

        对7款流行的 ATARI 游戏进行了实验:Beam Rider,Breakout,Enduro,Pong,Q*bert,Seaquest,Space Invaders。在所有7款游戏中使用了相同的网络架构、学习算法和超参数设置,这表明我们的方法足够健壮,可以在不包含特定游戏信息的情况下处理各种游戏。

        当在真实和未修改的游戏中评估代理时,只在训练期间对游戏的奖励结构做了一个改变。由于每个游戏的分数都有很大的不同,所以将所有正奖励设置为1,所有负奖励设置为-1,保持0奖励不变。

  1. 使用的 RMSProp 算法的小批量大小为32。
  2. 训练期间的行为策略使用 ϵ \epsilon ϵ-greedy ,且 ϵ \epsilon ϵ 从1线性递减到固定值0.1。
  3. 总共训练了1000万帧,并使用了具有100万最新帧的 replay memory。

        在监督学习中,通过在训练集和验证集上对模型进行评估,可以很容易地跟踪模型在训练中的表现。然而,在强化学习中,在训练过程中准确地评估agent的进展是具有挑战性的。

        这里的评估标准是agent在一集或一场游戏中所收集到的总奖励除以若干游戏的平均值。图2中最左边的两个图形显示了在游戏Seaquest和Breakout中训练期间平均总奖励的变化。
        另一个更稳定的度量是策略的估计行动-值函数Q,它提供了agent在任何给定状态下通过遵循策略可以获得多少折扣奖励的估计值。在训练开始之前,通过运行一个随机策略来收集一组固定的状态,并跟踪这些状态的最大预测值Q的平均值。图2中最右边的两个图形表明,平均预测Q的增长比agent获得的平均总奖励曲线要平滑的多。

在这里插入图片描述

# Deep Reinforcement Learning for Keras [![Build Status](https://api.travis-ci.org/matthiasplappert/keras-rl.svg?branch=master)](https://travis-ci.org/matthiasplappert/keras-rl) [![Documentation](https://readthedocs.org/projects/keras-rl/badge/)](http://keras-rl.readthedocs.io/) [![License](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/matthiasplappert/keras-rl/blob/master/LICENSE) [![Join the chat at https://gitter.im/keras-rl/Lobby](https://badges.gitter.im/keras-rl/Lobby.svg)](https://gitter.im/keras-rl/Lobby) ## What is it? `keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library [Keras](http://keras.io). Just like Keras, it works with either [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/), which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, `keras-rl` works with [OpenAI Gym](https://gym.openai.com/) out of the box. This means that evaluating and playing around with different algorithms is easy. Of course you can extend `keras-rl` according to your own needs. You can use built-in Keras callbacks and metrics or define your own. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. In a nutshell: `keras-rl` makes it really easy to run state-of-the-art deep reinforcement learning algorithms, uses Keras and thus Theano or TensorFlow and was built with OpenAI Gym in mind. ## What is included? As of today, the following algorithms have been implemented: - Deep Q Learning (DQN) [[1]](http://arxiv.org/abs/1312.5602), [[2]](http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_et_al.pdf) - Double DQN [[3]](http://arxiv.org/abs/1509.06461) - Deep Deterministic Policy Gradient (DDPG) [[4]](http://arxiv.org/abs/1509.02971) - Continuous DQN (CDQN or NAF) [[6]](http://arxiv.org/abs/1603.00748) - Cross-Entropy Method (CEM) [[7]](http://learning.mpi-sws.org/mlss2016/slides/2016-MLSS-RL.pdf), [[8]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf) - Dueling network DQN (Dueling DQN) [[9]](https://arxiv.org/abs/1511.06581) - Deep SARSA [[10]](http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf) You can find more information on each agent in the [wiki](https://github.com/matthiasplappert/keras-rl/wiki/Agent-Overview). I'm currently working on the following algorithms, which can be found on the `experimental` branch: - Asynchronous Advantage Actor-Critic (A3C) [[5]](http://arxiv.org/abs/1602.01783) Notice that these are **only experimental** and might currently not even run. ## How do I install it and how do I get started? Installing `keras-rl` is easy. Just run the following commands and you should be good to go: ```bash pip install keras-rl ``` This will install `keras-rl` and all necessary dependencies. If you want to run the examples, you'll also have to install `gym` by OpenAI. Please refer to [their installation instructions](https://github.com/openai/gym#installation). It's quite easy and works nicely on Ubuntu and Mac OS X. You'll also need the `h5py` package to load and save model weights, which can be installed using the following command: ```bash pip install h5py ``` Once you have installed everything, you can try out a simple example: ```bash python examples/dqn_cartpole.py ``` This is a very simple example and it should converge relatively quickly, so it's a great way to get started! It also visualizes the game during training, so you can watch it learn. How cool is that? Unfortunately, the documentation of `keras-rl` is currently almost non-existent. However, you can find a couple of more examples that illustrate the usage of both DQN (for tasks with discrete actions) as well as for DDPG (for tasks with continuous actions). While these examples are not replacement for a proper documentation, they should be enough to get started quickly and to see the magic of reinforcement learning yourself. I also encourage you to play around with other environments (OpenAI Gym has plenty) and maybe even try to find better hyperparameters for the existing ones. If you have questions or problems, please file an issue or, even better, fix the problem yourself and submit a pull request! ## Do I have to train the models myself? Training times can be very long depending on the complexity of the environment. [This repo](https://github.com/matthiasplappert/keras-rl-weights) provides some weights that were obtained by running (at least some) of the examples that are included in `keras-rl`. You can load the weights using the `load_weights` method on the respective agents. ## Requirements - Python 2.7 - [Keras](http://keras.io) >= 1.0.7 That's it. However, if you want to run the examples, you'll also need the following dependencies: - [OpenAI Gym](https://github.com/openai/gym) - [h5py](https://pypi.python.org/pypi/h5py) `keras-rl` also works with [TensorFlow](https://www.tensorflow.org/). To find out how to use TensorFlow instead of [Theano](http://deeplearning.net/software/theano/), please refer to the [Keras documentation](http://keras.io/#switching-from-theano-to-tensorflow). ## Documentation We are currently in the process of getting a proper documentation going. [The latest version of the documentation is available online](http://keras-rl.readthedocs.org). All contributions to the documentation are greatly appreciated! ## Support You can ask questions and join the development discussion: - On the [Keras-RL Google group](https://groups.google.com/forum/#!forum/keras-rl-users). - On the [Keras-RL Gitter channel](https://gitter.im/keras-rl/Lobby). You can also post **bug reports and feature requests** (only!) in [Github issues](https://github.com/matthiasplappert/keras-rl/issues). ## Running the Tests To run the tests locally, you'll first have to install the following dependencies: ```bash pip install pytest pytest-xdist pep8 pytest-pep8 pytest-cov python-coveralls ``` You can then run all tests using this command: ```bash py.test tests/. ``` If you want to check if the files conform to the PEP8 style guidelines, run the following command: ```bash py.test --pep8 ``` ## Citing If you use `keras-rl` in your research, you can cite it as follows: ```bibtex @misc{plappert2016kerasrl, author = {Matthias Plappert}, title = {keras-rl}, year = {2016}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/matthiasplappert/keras-rl}}, } ``` ## Acknowledgments The foundation for this library was developed during my work at the [High Performance Humanoid Technologies (H²T)](https://h2t.anthropomatik.kit.edu/) lab at the [Karlsruhe Institute of Technology (KIT)](https://kit.edu). It has since been adapted to become a general-purpose library. ## References 1. *Playing Atari with Deep Reinforcement Learning*, Mnih et al., 2013 2. *Human-level control through deep reinforcement learning*, Mnih et al., 2015 3. *Deep Reinforcement Learning with Double Q-learning*, van Hasselt et al., 2015 4. *Continuous control with deep reinforcement learning*, Lillicrap et al., 2015 5. *Asynchronous Methods for Deep Reinforcement Learning*, Mnih et al., 2016 6. *Continuous Deep Q-Learning with Model-based Acceleration*, Gu et al., 2016 7. *Learning Tetris Using the Noisy Cross-Entropy Method*, Szita et al., 2006 8. *Deep Reinforcement Learning (MLSS lecture notes)*, Schulman, 2016 9. *Dueling Network Architectures for Deep Reinforcement Learning*, Wang et al., 2016 10. *Reinforcement learning: An introduction*, Sutton and Barto, 2011 ## Todos - Documentation: Work on the documentation has begun but not everything is documented in code yet. Additionally, it would be super nice to have guides for each agents that describe the basic ideas behind it. - TRPO, priority-based memory, A3C, async DQN, ...
Deep Q-Learning是一种深度强化学习方法,结合了Q-Learning和神经网络技术。在传统的Q-Learning方法中,我们使用Q表来存储每个状态下每个动作的Q值。然而,在现实问题中,状态和动作的数量可能非常大,这使得存储和查找Q表变得困难。 Deep Q-Learning通过使用神经网络来解决这个问题。具体而言,将状态和动作作为输入,神经网络分析后输出每个动作的Q值。这样,我们就可以通过神经网络来近似Q值函数,而不再需要存储和查找巨大的Q表。此外,神经网络还能够捕捉到一些细节特征,从而提高强化学习的性能。 在Deep Q-Learning中,网络的更新是通过最小化预测Q值和目标Q值之间的差异来完成的。具体来说,我们通过使用均方误差损失函数来计算这种差异,然后使用梯度下降法来更新网络的参数。在每个时间步,我们选择一个动作并执行它,然后观察到新的状态和奖励。接着,我们计算目标Q值,这是当前奖励加上未来状态的最大Q值的折扣回报。最后,我们使用梯度下降法来更新网络参数,使得预测Q值逼近目标Q值。 通过这种方式,Deep Q-Learning能够学习到每个状态下每个动作的最优Q值,并且能够在复杂的环境中取得良好的性能。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [Deep Q-LearningDQN)](https://blog.youkuaiyun.com/weixin_42104932/article/details/106024607)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *2* *3* [【reinforcement learningDeep Q-Learning(DQN)简介](https://blog.youkuaiyun.com/qq_40715044/article/details/108366035)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值