Exercise 3.23 Give the Bellman equation for q∗q_*q∗ for the recycling robot.

This picture shows the mechanism of the recycling robot.
To give the Bellman equation for q∗q_*q∗ for the recycling robot, we have to enumerate equations for q∗(sh,as)q_*(s_h, a_s)q∗(sh,as), q∗(sh,aw)q_*(s_h, a_w)q∗(sh,aw), q∗(sh,ar)q_*(s_h, a_r)q∗(sh,ar), q∗(sl,as)q_*(s_l, a_s)q∗(sl,as),q∗(sl,aw)q_*(s_l, a_w)q∗(sl,aw) and q∗(sl,ar)q_*(s_l, a_r)q∗(sl,ar). Here, the subscripts h, l, s, w, r respectively denotes ‘high’, ‘low’, ‘search’, ‘wait’, ‘recharge’. For ‘high’ status, the available actions are ‘search’ and ‘wait’, so q∗(sh,ar)q_*(s_h, a_r)q∗(sh,ar) is excluded.
First, we have to introduce the equation (1) from exercise 3.22:
q∗(s,a)=∑s′{[Rs,s′a+γmaxa′q∗(s′,a′)]Ps,s′a}(1)
q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \qquad{(1)}
q∗(s,a)=s′∑{[Rs,s′a+γa′maxq∗(s′,a′)]Ps,s′a}(1)
For status ‘high’, we have:
q∗(sh,as)=[Rsh,shas+γmaxa′q∗(sh,a′)]Psh,shas+[Rsh,slas+γmaxa′q∗(sl,a′)]Psh,slas(2)q∗(sh,aw)=[Rsh,shaw+γmaxa′q∗(sh,a′)]Psh,shaw+[Rsh,slaw+γmaxa′q∗(sl,a′)]Psh,slaw(3)
\begin{aligned}
q_*(s_h, a_s) = \bigl [ R_{s_h, s_h}^{a_s} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_h,s_h}^{a_s} + \bigl [ R_{s_h, s_l}^{a_s} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_h,s_l}^{a_s} \qquad{(2)}\\
q_*(s_h, a_w) = \bigl [ R_{s_h, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_h,s_h}^{a_w} + \bigl [ R_{s_h, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_h,s_l}^{a_w} \qquad{(3)}
\end{aligned}
q∗(sh,as)=[Rsh,shas+γa′maxq∗(sh,a′)]Psh,shas+[Rsh,slas+γa′maxq∗(sl,a′)]Psh,slas(2)q∗(sh,aw)=[Rsh,shaw+γa′maxq∗(sh,a′)]Psh,shaw+[Rsh,slaw+γa′maxq∗(sl,a′)]Psh,slaw(3)
For status ‘low’, there are:
q∗(sl,as)=[Rsl,shas+γmaxa′q∗(sh,a′)]Psl,shas+[Rsl,slas+γmaxa′q∗(sl,a′)]Psl,slas(4)q∗(sl,aw)=[Rsl,shaw+γmaxa′q∗(sh,a′)]Psl,shaw+[Rsl,slaw+γmaxa′q∗(sl,a′)]Psl,slaw(5)q∗(sl,ar)=[Rsl,shar+γmaxa′q∗(sh,a′)]Psl,shar+[Rsl,slar+γmaxa′q∗(sl,a′)]Psl,slar(6)
\begin{aligned}
q_*(s_l, a_s) &= \bigl [ R_{s_l, s_h}^{a_s} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_s} + \bigl [ R_{s_l, s_l}^{a_s} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_s} \qquad{(4)} \\
q_*(s_l, a_w) &= \bigl [ R_{s_l, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_w} + \bigl [ R_{s_l, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_w} \qquad{(5)} \\
q_*(s_l, a_r) &= \bigl [ R_{s_l, s_h}^{a_r} + \gamma \max_{a'} q_*(s_h,a') \bigr ] P_{s_l,s_h}^{a_r} + \bigl [ R_{s_l, s_l}^{a_r} + \gamma \max_{a'} q_*(s_l,a') \bigr ] P_{s_l,s_l}^{a_r} \qquad{(6)}
\end{aligned}
q∗(sl,as)q∗(sl,aw)q∗(sl,ar)=[Rsl,shas+γa′maxq∗(sh,a′)]Psl,shas+[Rsl,slas+γa′maxq∗(sl,a′)]Psl,slas(4)=[Rsl,shaw+γa′maxq∗(sh,a′)]Psl,shaw+[Rsl,slaw+γa′maxq∗(sl,a′)]Psl,slaw(5)=[Rsl,shar+γa′maxq∗(sh,a′)]Psl,shar+[Rsl,slar+γa′maxq∗(sl,a′)]Psl,slar(6)
Then according to the table in the above picture, Rsh,shas=rsearchR_{s_h,s_h}^{a_s}=r_{search}Rsh,shas=rsearch, Psh,shas=αP_{s_h,s_h}^{a_s}=\alphaPsh,shas=α, Rsh,slas=rsearchR_{s_h,s_l}^{a_s}=r_{search}Rsh,slas=rsearch, Psh,slas=1−αP_{s_h,s_l}^{a_s}=1-\alphaPsh,slas=1−α, … and so on. Plug these values into equations (2), (3), (4), (5), (6), we get:
q∗(sh,as)=[rsearch+γmaxa′q∗(sh,a′)]α+[rsearch+γmaxa′q∗(sl,a′)](1−α)=rsearch+γ[αmaxa′q∗(sh,a′)+(1−α)maxa′q∗(sl,a′)](7)q∗(sh,aw)=[rwait+γmaxa′q∗(sh,a′)]⋅1+[Rsh,slaw+γmaxa′q∗(sl,a′)]⋅0=rwait+γmaxa′q∗(sh,a′)(8)q∗(sl,as)=[−3+γmaxa′q∗(sh,a′)](1−β)+[rsearch+γmaxa′q∗(sl,a′)]β=(rsearch−3)+γ[(1−β)maxa′q∗(sh,a′)+βmaxa′q∗(sl,a′)](9)q∗(sl,aw)=[Rsl,shaw+γmaxa′q∗(sh,a′)]⋅0+[rwait+γmaxa′q∗(sl,a′)]⋅1=rwait+γmaxa′q∗(sl,a′)(10)q∗(sl,ar)=[0+γmaxa′q∗(sh,a′)]⋅1+[Rsl,slar+γmaxa′q∗(sl,a′)]⋅0=γmaxa′q∗(sh,a′)(11)
\begin{aligned}
q_*(s_h, a_s) &= \bigl [ r_{search} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \alpha + \bigl [ r_{search} + \gamma \max_{a'} q_*(s_l,a') \bigr ] (1-\alpha)\\ &= r_{search} + \gamma \bigl [\alpha \max_{a'} q_*(s_h,a') +(1-\alpha) \max_{a'} q_*(s_l,a')\bigr ]\qquad{(7)}\\
q_*(s_h, a_w) &= \bigl [ r_{wait} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 1 + \bigl [ R_{s_h, s_l}^{a_w} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 0 \\
&= r_{wait} + \gamma \max_{a'} q_*(s_h,a')\qquad{(8)}\\
q_*(s_l, a_s) &= \bigl [ -3 + \gamma \max_{a'} q_*(s_h,a') \bigr ] (1-\beta) + \bigl [ r_{search} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \beta \\
&= (r_{search} - 3) + \gamma \bigl [ (1-\beta)\max_{a'} q_*(s_h,a') + \beta \max_{a'} q_*(s_l,a') \bigr ] \qquad{(9)} \\
q_*(s_l, a_w) &= \bigl [ R_{s_l, s_h}^{a_w} + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 0 + \bigl [ r_{wait} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 1 \\ &= r_{wait} + \gamma \max_{a'} q_*(s_l,a') \qquad{(10)} \\
q_*(s_l, a_r) &= \bigl [ 0 + \gamma \max_{a'} q_*(s_h,a') \bigr ] \cdot 1 + \bigl [ R_{s_l, s_l}^{a_r} + \gamma \max_{a'} q_*(s_l,a') \bigr ] \cdot 0 \\
&= \gamma \max_{a'} q_*(s_h,a')\qquad{(11)}
\end{aligned}
q∗(sh,as)q∗(sh,aw)q∗(sl,as)q∗(sl,aw)q∗(sl,ar)=[rsearch+γa′maxq∗(sh,a′)]α+[rsearch+γa′maxq∗(sl,a′)](1−α)=rsearch+γ[αa′maxq∗(sh,a′)+(1−α)a′maxq∗(sl,a′)](7)=[rwait+γa′maxq∗(sh,a′)]⋅1+[Rsh,slaw+γa′maxq∗(sl,a′)]⋅0=rwait+γa′maxq∗(sh,a′)(8)=[−3+γa′maxq∗(sh,a′)](1−β)+[rsearch+γa′maxq∗(sl,a′)]β=(rsearch−3)+γ[(1−β)a′maxq∗(sh,a′)+βa′maxq∗(sl,a′)](9)=[Rsl,shaw+γa′maxq∗(sh,a′)]⋅0+[rwait+γa′maxq∗(sl,a′)]⋅1=rwait+γa′maxq∗(sl,a′)(10)=[0+γa′maxq∗(sh,a′)]⋅1+[Rsl,slar+γa′maxq∗(sl,a′)]⋅0=γa′maxq∗(sh,a′)(11)
For ‘high’ status, a′a'a′ can only be ‘search’ and ‘wait’ while for ‘low’ status, a′a'a′ can be ‘search’, ‘wait’ and ‘recharge’. So, equations (7) to (11) can be rearranged as below:
q∗(sh,as)=rsearch+γ{αmaxa′[q∗(sh,as),q∗(sh,aw)]+(1−α)maxa′[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)]}(12)q∗(sh,aw)=rwait+γmaxa′[q∗(sh,as),q∗(sh,aw)](13)q∗(sl,as)=(rsearch−3)+γ{(1−β)maxa′[q∗(sh,as),q∗(sh,aw)]+βmaxa′[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)]}(14)q∗(sl,aw)=rwait+γmaxa′[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)](15)q∗(sl,ar)=γmaxa′[q∗(sh,as),q∗(sh,aw)](16)
\begin{aligned}
q_*(s_h, a_s) &= r_{search} + \gamma \Bigl \{\alpha \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ]+(1-\alpha) \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \Bigr \} \qquad{(12)} \\
q_*(s_h, a_w) &= r_{wait} + \gamma \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] \qquad {(13)}\\
q_*(s_l, a_s) &= (r_{search} - 3) + \gamma \Bigl \{ (1-\beta)\max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] + \beta \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \Bigr \} \qquad{(14)} \\
q_*(s_l, a_w) &= r_{wait} + \gamma \max_{a'} \bigl [ q_*(s_l,a_s), q_*(s_l,a_w), q_*(s_l,a_r) \bigr ] \qquad{(15)} \\
q_*(s_l, a_r) &= \gamma \max_{a'} \bigl [ q_*(s_h,a_s), q_*(s_h,a_w) \bigr ] \qquad{(16)}
\end{aligned}
q∗(sh,as)q∗(sh,aw)q∗(sl,as)q∗(sl,aw)q∗(sl,ar)=rsearch+γ{αa′max[q∗(sh,as),q∗(sh,aw)]+(1−α)a′max[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)]}(12)=rwait+γa′max[q∗(sh,as),q∗(sh,aw)](13)=(rsearch−3)+γ{(1−β)a′max[q∗(sh,as),q∗(sh,aw)]+βa′max[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)]}(14)=rwait+γa′max[q∗(sl,as),q∗(sl,aw),q∗(sl,ar)](15)=γa′max[q∗(sh,as),q∗(sh,aw)](16)
These equations from (12) to (16) are the Bellman equations for the recycling robot and can be solved in a similar way like exercise 3.22.
博客围绕回收机器人的q∗贝尔曼方程展开,先需枚举不同状态和动作组合下的q∗方程,引入相关公式,再根据状态分为‘高’和‘低’两种情况列出方程,代入对应值后得到新方程,最后重新整理得出回收机器人的贝尔曼方程。
1200

被折叠的 条评论
为什么被折叠?



