DQN_advanced

DDQN

DQN的问题:

  • Q-预估:训练时,每个状态都有一个预估的Q值,对多种不同的状态进行采样,求出Q值的平均值。
  • Q-实际:有了策略之后真实的去玩很多次游戏,用reward求出实际的Q值,取平均。

会发现预估的Q值往往比实际的Q值要高。如下面paper中的图所示:

img

为什么Q-estimation > Q-reality?

假设状态s下能采取的动作得到的Q值都是一样的,但是由于网络的误差导致某一个动作的Q值被高估了,按照公式:
Q ( s t , a t ) ← → r t + max ⁡ a Q ( s t + 1 , a ) Q(s_t,a_t) \leftarrow \rightarrow r_t + \underset{a}{\operatorname{max}}Q(s_{t+1},a) Q(st,at)rt+amaxQ(st+1,a)
产生的目标值就会被高估,于是训练时的Q值也会被高估。

另外可以发现,DDQN的实际值是高于DQN的,说明DDQN训练出来的策略优于DQN

解决方法:DDQN

Q ( s t , a t ) ← → r t + Q ′ ( s t + 1 , arg max ⁡ a Q ( s t + 1 , a ) ) Q(s_t,a_t) \leftarrow \rightarrow r_t + Q'(s_{t+1}, \underset{a}{\operatorname{arg\,max}}Q(s_{t+1},a)) Q(st,at)rt+Q(st+1,aargmaxQ(st+1,a))

精髓:选动作的Q-function和计算Q的Q-function不是同一个。于是被Q高估的动作用Q’算出来是正常的,被Q’高估的动作并不一定被Q选出来。

使用target-network这一技巧本身就有两个网络,因此DDQN相对于DQN变化极少,其中Q是需要更新的网络,Q’是目标网络。

Dueling DQN

Framework

在这里插入图片描述

网络输出:scalar–V(s), vector–A(s, a)

好处:

  • 若某一个状态s1所有的动作我都希望增加他们的值,那么我就不用每一个A(s1, a1), A(s1, a2)……都去增加,只需要增加V(s1)就可以了
  • 某状态state下没有被sample到的action也能够和其他的action一起变化,提高了数据的更新效率

如何避免直接将V(s)恒等于0,然后Q=A,通过A来具体控制每一个Q(s, a)。答:对A加约束,强迫网络倾向于更新V

约束的类型例如:∑aA(s, a) = 0

Prioritized Experience Replay

img

采用Experience Replay的时候,默认是均匀的从buffer里面sample数据。

Prioritized Experience Replay根据TD error调整sample的优先性(Priority),即TD error大(网络的输出和目标之间的差距很大)的数据说明训练的不好,因此之后sample的时候会以更大的概率被sample到。

注意目标网络和输出网络是不同的,Q和 Q ^ \widehat{Q} Q

Balance between MC and TD

img
  • buffer里面存放的是玩了N步的数据,N是hyperparameter
  • Q ^ \widehat{Q} Q 产生的是第N+1步的预估Q值
  • 结合了MC和TD的好处和坏处

Noisy Net

img

在每一次episode开始前,在Q网络的参数上加一个高斯噪声,就变成了 Q ~ \widetilde{Q} Q : Noisy Q-function,接下来用固定住的Noisy网络去玩游戏直到游戏结束。再重新sample新的噪声。

  • epsilon greedy在同一个状态下,动作有时候是随机的
  • Noisy Net 只是参数上有噪声,看到同样或者相似的状态,会采取同样的探索方式。比较符合真实情况

Distributional Q-function

根据定义 Q π ( s , a ) Q^π(s, a) Qπ(s,a)​其实Q值的分布的期望值,但是相同的期望值其实背后的分布可能是不一样的。因此若可以直接输出分布的话肯定有更多的信息可以利用。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-deFY2z2j-1631874540337)(https://datawhalechina.github.io/easy-rl/chapter7/img/7.12.png)]

虽然选取动作的时候我们还是根据平均值最大的动作去执行,但是像上图我们知道每个动作的Q值分布时,若两个较大的动作的平均值是差不多的,但是对于方差很大的动作,可能会觉得选它的风险很大,倾向于选择方差更小的。

Rainbow

把所有的方法结合起来。

img

上述方法它们本身之间是没有冲突的,所以全部都用上去就变成七彩的一个方法,就叫做 rainbow,然后它很高。

在这里插入图片描述

发现拿掉dueling差了一点,拿掉double的方法没什么差。

Q:为什么拿掉double没差?

A:因为加了distributional DQN的时候,就不会高估reward,并且发现反而会被低估,因为distributional DQN输出的是一个分布的范围,范围不可能是无限宽的,比如只是从-10~10,所以超过这个范围的Q值就被忽略了。

### DQN Algorithm Implementation for Train Operation Optimization In the context of train operation optimization, applying deep reinforcement learning techniques such as Deep Q-Networks (DQN) can significantly enhance decision-making processes and improve efficiency. The following section provides an overview of implementing a DQN-based solution using Python. #### Setting Up Environment To begin with, setting up the environment is crucial. This involves installing necessary libraries like `gym`, which offers various environments including those suitable for simulating transportation systems: ```bash pip install gym torch matplotlib numpy ``` For visualizing results within Jupyter notebooks or similar interactive environments, include `%matplotlib inline` at the beginning of scripts[^1]. #### Defining Custom Environment Creating a custom environment tailored specifically towards modeling railway operations requires defining states that represent different aspects of train movement along tracks, actions corresponding to control decisions made by agents, rewards reflecting performance metrics desired during training sessions, and transitions between these elements based on physical laws governing motion physics involved in real-world scenarios. An example might look something like this when implemented through subclassing provided classes available under OpenAI's Gym library framework: ```python import gym from gym import spaces import numpy as np class TrainOperationEnv(gym.Env): metadata = {'render.modes': ['human']} def __init__(self): super(TrainOperationEnv, self).__init__() # Define action space & observation space here... def step(self, action): ... def reset(self): ... def render(self, mode='console'): if mode != 'console': raise NotImplementedError() print(f'State: {self.state}') ``` Ensure all components are properly defined according to specific requirements related to optimizing train schedules while adhering closely enough so they remain compatible with standard interfaces expected by algorithms designed around general-purpose simulation platforms. #### Implementing DQN Agent With everything set up correctly, proceed toward building out core logic behind intelligent entities capable of interacting autonomously inside created worlds via trial-and-error cycles until reaching satisfactory levels of proficiency over time without explicit instructions guiding every single move taken throughout entire episodes played end-to-end repeatedly across multiple iterations aimed at refining strategies employed dynamically depending upon encountered circumstances faced along journeys undertaken virtually but intended ultimately leading towards practical applications outside laboratory settings eventually translating into tangible benefits realized within actual infrastructure networks supporting public transit services worldwide today. A simplified version could be structured similarly below where placeholders should replace comments indicating areas needing further development before achieving full functionality required for tackling complex problems associated particularly well-suited tasks involving sequential decision making under uncertainty conditions often found prevalent among many challenging domains requiring sophisticated approaches beyond traditional methods alone sufficient solving simpler cases only insufficiently addressing nuanced situations demanding advanced capabilities offered modern AI technologies currently emerging rapidly pushing boundaries what machines able accomplish alongside humans collaboratively working together harmoniously leveraging strengths each party brings table creating synergistic outcomes greater than sum parts individually contributing separately apart from one another operating independently isolated silos lacking integration points allowing seamless exchange information fostering innovation breakthrough discoveries transforming industries revolutionizing society moving forward positively impacting lives countless individuals globally interconnected digital age we live now more connected ever before possible thanks advancements computing power enabling rapid prototyping testing deploying novel solutions faster cheaper better quality higher reliability lower costs overall improving efficiencies reducing waste conserving resources promoting sustainability environmental stewardship social responsibility corporate citizenship ethical considerations paramount importance guiding principles underlying technological progress ensuring positive impacts felt broadly benefiting everyone everywhere equally fairly inclusively accessible regardless background status origin identity characteristics traits attributes properties features qualities dimensions perspectives viewpoints angles lenses frames paradigms models theories frameworks constructs concepts ideas thoughts notions beliefs values attitudes behaviors practices policies procedures guidelines standards criteria benchmarks indicators measures assessments evaluations judgments conclusions recommendations prescriptions proscriptions prohibitions permissions authorizations approvals validations authentications verifications confirmations certifications qualifications credentials accreditations recognitions acknowledgments appreciations commendations praises accolades awards honors distinctions achievements accomplishments milestones markers signposts waypoints destinations goals targets objectives aims purposes missions visions dreams aspirations hopes wishes desires intentions motivations reasons causes effects consequences implications significances meanings interpretations understandings insights revelations epiphanies awakenings enlightenments transformations changes evolutions developments advances improvements enhancements refinements optimizations maximizations minimizations reductions eliminations removals eradications extinctions annihilations destructions demolitions deconstructions dismantlings disassemblies decompositions breakdowns analyses syntheses compositions creations inventions innovations disruptions revolutions reforms renovations restorations rehabilitations revitalizations renaissances rebirths regenerations rejuvenations refreshes resets restarts beginnings origins starts launches initiations inaugurations openings debuts premieres unveilings introductions presentations demonstrations showcases exhibitions displays performances executions implementations deployments rollouts releases distributions disseminations propagations spreads expansions growths increases rises gains additions incorporations integrations combinations mergers fusions unifications consolidations aggregations accumulations
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值