Reinforcement Learning Exercise 3.23

博客围绕回收机器人的q∗贝尔曼方程展开,先需枚举不同状态和动作组合下的q∗方程,引入相关公式,再根据状态分为‘高’和‘低’两种情况列出方程,代入对应值后得到新方程,最后重新整理得出回收机器人的贝尔曼方程。

Exercise 3.23 Give the Bellman equation for q∗q_*q for the recycling robot.
在这里插入图片描述
This picture shows the mechanism of the recycling robot.

To give the Bellman equation for q∗q_*q for the recycling robot, we have to enumerate equations for q∗(sh,as)q_*(s_h, a_s)q(sh,as), q∗(sh,aw)q_*(s_h, a_w)q(sh,aw), q∗(sh,ar)q_*(s_h, a_r)q(sh,ar), q∗(sl,as)q_*(s_l, a_s)q(sl,as),q∗(sl,aw)q_*(s_l, a_w)q(sl,aw) and q∗(sl,ar)q_*(s_l, a_r)q(sl,ar). Here, the subscripts h, l, s, w, r respectively denotes ‘high’, ‘low’, ‘search’, ‘wait’, ‘recharge’. For ‘high’ status, the available actions are ‘search’ and ‘wait’, so q∗(sh,ar)q_*(s_h, a_r)q(sh,ar) is excluded.
First, we have to introduce the equation (1) from exercise 3.22:
q∗(s,a)=∑s′{[Rs,s′a+γmax⁡a′q∗(s′,a′)]Ps,s′a}(1) q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \qquad{(1)} q(s,a)=s{[Rs,sa+γamaxq(s,a)]Ps,sa}(1)
For status ‘high’, we have:
q∗(sh,as)=[Rsh,shas+γmax⁡a′q∗(sh,a′)]Psh,shas+[Rsh,slas+γmax⁡a′q∗(sl,a′)]Psh,slas(2)q∗(sh,aw)=[Rsh,shaw+γmax⁡a′q∗(sh,a′)]Psh,shaw+[Rsh,slaw+γmax⁡a′q∗(sl,a′)]Psh,slaw(3) \begin{aligned} q_*(s_h, a_s) = \bigl [ R_{s_h, s_h}^{a_s} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_h,s_h}^{a_s} + \bigl [ R_{s_h, s_l}^{a_s} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_h,s_l}^{a_s} \qquad{(2)}\\ q_*(s_h, a_w) = \bigl [ R_{s_h, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_h,s_h}^{a_w} + \bigl [ R_{s_h, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_h,s_l}^{a_w} \qquad{(3)} \end{aligned} q(sh,as)=[Rsh,shas+γamaxq(sh,a)]Psh,shas+[Rsh,slas+γamaxq(sl,a)]Psh,slas(2)q(sh,aw)=[Rsh,shaw+γamaxq(sh,a)]Psh,shaw+[Rsh,slaw+γamaxq(sl,a)]Psh,slaw(3)
For status ‘low’, there are:
q∗(sl,as)=[Rsl,shas+γmax⁡a′q∗(sh,a′)]Psl,shas+[Rsl,slas+γmax⁡a′q∗(sl,a′)]Psl,slas(4)q∗(sl,aw)=[Rsl,shaw+γmax⁡a′q∗(sh,a′)]Psl,shaw+[Rsl,slaw+γmax⁡a′q∗(sl,a′)]Psl,slaw(5)q∗(sl,ar)=[Rsl,shar+γmax⁡a′q∗(sh,a′)]Psl,shar+[Rsl,slar+γmax⁡a′q∗(sl,a′)]Psl,slar(6) \begin{aligned} q_*(s_l, a_s) &= \bigl [ R_{s_l, s_h}^{a_s} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_s} + \bigl [ R_{s_l, s_l}^{a_s} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_s} \qquad{(4)} \\ q_*(s_l, a_w) &= \bigl [ R_{s_l, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_w} + \bigl [ R_{s_l, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_w} \qquad{(5)} \\ q_*(s_l, a_r) &= \bigl [ R_{s_l, s_h}^{a_r} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_r} + \bigl [ R_{s_l, s_l}^{a_r} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_r} \qquad{(6)} \end{aligned} q(sl,as)q(sl,aw)q(sl,ar)=[Rsl,shas+γamaxq(sh,a)]Psl,shas+[Rsl,slas+γamaxq(sl,a)]Psl,slas(4)=[Rsl,shaw+γamaxq(sh,a)]Psl,shaw+[Rsl,slaw+γamaxq(sl,a)]Psl,slaw(5)=[Rsl,shar+γamaxq(sh,a)]Psl,shar+[Rsl,slar+γamaxq(sl,a)]Psl,slar(6)
Then according to the table in the above picture, Rsh,shas=rsearchR_{s_h,s_h}^{a_s}=r_{search}Rsh,shas=rsearch, Psh,shas=αP_{s_h,s_h}^{a_s}=\alphaPsh,shas=α, Rsh,slas=rsearchR_{s_h,s_l}^{a_s}=r_{search}Rsh,slas=rsearch, Psh,slas=1−αP_{s_h,s_l}^{a_s}=1-\alphaPsh,slas=1α, … and so on. Plug these values into equations (2), (3), (4), (5), (6), we get:
q∗(sh,as)=[rsearch+γmax⁡a′q∗(sh,a′)]α+[rsearch+γmax⁡a′q∗(sl,a′)](1−α)=rsearch+γ[αmax⁡a′q∗(sh,a′)+(1−α)max⁡a′q∗(sl,a′)](7)q∗(sh,aw)=[rwait+γmax⁡a′q∗(sh,a′)]⋅1+[Rsh,slaw+γmax⁡a′q∗(sl,a′)]⋅0=rwait+γmax⁡a′q∗(sh,a′)(8)q∗(sl,as)=[−3+γmax⁡a′q∗(sh,a′)](1−β)+[rsearch+γmax⁡a′q∗(sl,a′)]β=(rsearch−3)+γ[(1−β)max⁡a′q∗(sh,a′)+βmax⁡a′q∗(sl,a′)](9)q∗(sl,aw)=[Rsl,shaw+γmax⁡a′q∗(sh,a′)]⋅0+[rwait+γmax⁡a′q∗(sl,a′)]⋅1=rwait+γmax⁡a′q∗(sl,a′)(10)q∗(sl,ar)=[0+γmax⁡a′q∗(sh,a′)]⋅1+[Rsl,slar+γmax⁡a′q∗(sl,a′)]⋅0=γmax⁡a′q∗(sh,a′)(11) \begin{aligned} q_*(s_h, a_s) &= \bigl [ r_{search} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \alpha + \bigl [ r_{search} + \gamma \max_{a'} q_*(s_l,a') \bigr ] (1-\alpha)\\ &= r_{search} + \gamma \bigl [\alpha \max_{a'} q_*(s_h,a') +(1-\alpha) \max_{a'} q_*(s_l,a')\bigr ]\qquad{(7)}\\ q_*(s_h, a_w) &= \bigl [ r_{wait} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 1 + \bigl [ R_{s_h, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 0 \\ &= r_{wait} + \gamma \max_{a'} q_*(s_h,a')\qquad{(8)}\\ q_*(s_l, a_s) &= \bigl [ -3 + \gamma \max_{a'} q_*(s_h,a') \bigr ] (1-\beta) + \bigl [ r_{search} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \beta \\ &= (r_{search} - 3) + \gamma \bigl [ (1-\beta)\max_{a'} q_*(s_h,a') + \beta \max_{a'} q_*(s_l,a') \bigr ] \qquad{(9)} \\ q_*(s_l, a_w) &= \bigl [ R_{s_l, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 0 + \bigl [ r_{wait} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 1 \\ &= r_{wait} + \gamma \max_{a'} q_*(s_l,a') \qquad{(10)} \\ q_*(s_l, a_r) &= \bigl [ 0 + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 1 + \bigl [ R_{s_l, s_l}^{a_r} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 0 \\ &= \gamma \max_{a'} q_*(s_h,a')\qquad{(11)} \end{aligned} q(sh,as)q(sh,aw)q(sl,as)q(sl,aw)q(sl,ar)=[rsearch+γamaxq(sh,a)]α+[rsearch+γamaxq(sl,a)](1α)=rsearch+γ[αamaxq(sh,a)+(1α)amaxq(sl,a)](7)=[rwait+γamaxq(sh,a)]1+[Rsh,slaw+γamaxq(sl,a)]0=rwait+γamaxq(sh,a)(8)=[3+γamaxq(sh,a)](1β)+[rsearch+γamaxq(sl,a)]β=(rsearch3)+γ[(1β)amaxq(sh,a)+βamaxq(sl,a)](9)=[Rsl,shaw+γamaxq(sh,a)]0+[rwait+γamaxq(sl,a)]1=rwait+γamaxq(sl,a)(10)=[0+γamaxq(sh,a)]1+[Rsl,slar+γamaxq(sl,a)]0=γamaxq(sh,a)(11)
For ‘high’ status, a′a'a can only be ‘search’ and ‘wait’ while for ‘low’ status, a′a'a can be ‘search’, ‘wait’ and ‘recharge’. So, equations (7) to (11) can be rearranged as below:
q∗(sh,as)=rsearch+γ{αmax⁡a′[q∗(sh,as),q∗(sh,aw)]+(1−α)max⁡a′[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)]}(12)q∗(sh,aw)=rwait+γmax⁡a′[q∗(sh,as),q∗(sh,aw)](13)q∗(sl,as)=(rsearch−3)+γ{(1−β)max⁡a′[q∗(sh,as),q∗(sh,aw)]+βmax⁡a′[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)]}(14)q∗(sl,aw)=rwait+γmax⁡a′[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)](15)q∗(sl,ar)=γmax⁡a′[q∗(sh,as),q∗(sh,aw)](16) \begin{aligned} q_*(s_h, a_s) &= r_{search} + \gamma \Bigl \{\alpha \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ]+(1-\alpha) \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \Bigr \} \qquad{(12)} \\ q_*(s_h, a_w) &= r_{wait} + \gamma \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] \qquad {(13)}\\ q_*(s_l, a_s) &= (r_{search} - 3) + \gamma \Bigl \{ (1-\beta)\max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] + \beta \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \Bigr \} \qquad{(14)} \\ q_*(s_l, a_w) &= r_{wait} + \gamma \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \qquad{(15)} \\ q_*(s_l, a_r) &= \gamma \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] \qquad{(16)} \end{aligned} q(sh,as)q(sh,aw)q(sl,as)q(sl,aw)q(sl,ar)=rsearch+γ{αamax[q(sh,as),q(sh,aw)]+(1α)amax[q(sl,as),q(sl,aw),q(sl,ar)]}(12)=rwait+γamax[q(sh,as),q(sh,aw)](13)=(rsearch3)+γ{(1β)amax[q(sh,as),q(sh,aw)]+βamax[q(sl,as),q(sl,aw),q(sl,ar)]}(14)=rwait+γamax[q(sl,as),q(sl,aw),q(sl,ar)](15)=γamax[q(sh,as),q(sh,aw)](16)
These equations from (12) to (16) are the Bellman equations for the recycling robot and can be solved in a similar way like exercise 3.22.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值