Reinforcement Learning Exercise 3.22

博客围绕一个持续马尔可夫决策过程(MDP)展开,在顶部状态有左右两个动作可选,存在两个确定性策略。通过推导q∗(s,a)表达式,得出贝尔曼最优方程,分别讨论了γ为0、0.5、0.9时的最优策略,γ=0时πleft最优,γ=0.5时两者皆可,γ=0.9时πright最优。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, πleft\pi_{left}πleft and πright\pi_{right}πright. What policy is optimal if γ=0\gamma = 0γ=0? If γ=0.9\gamma = 0.9γ=0.9? If γ=0.5\gamma = 0.5γ=0.5?
在这里插入图片描述
Before to solve this problem, we have to deduce the expression of q∗(s,a)q_*(s,a)q(s,a) in terms of Rs,s′aR_{s,s'}^aRs,sa and Ps,s′aP_{s,s'}^aPs,sa.
First,
q∗(s,a)=E[Rt+1+γmax⁡a′q∗(St+1,a′)∣St=s,At=a]=∑s′,r{p(s′,r∣s,a)[r+γmax⁡a′q∗(s′,a)]}=∑s′,r[rp(s′,r∣s,a)]+∑s′,r[p(s′,r∣s,a)γmax⁡a′q∗(s′,a′)]=∑r[rp(r∣s,a)]+∑s′[p(s′∣s,a)γmax⁡a′q∗(s′,a′)]=E(r∣s,a)+∑s′[p(s′∣s,a)γmax⁡a′q∗(s′,a′)]=∑s′[E(r∣s′,s,a)p(s′∣s,a)]+∑s′[p(s′∣s,a)γmax⁡a′q∗(s′,a′)]=∑s′{[E(r∣s′,s,a)+γmax⁡a′q∗(s′,a′)]p(s′∣s,a)} \begin{aligned} q_*(s,a) &= \mathbb E[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a')|S_t=s,A_t=a] \\ &= \sum_{s',r}\Bigl \{p(s',r|s,a) \bigl [ r + \gamma \max_{a'}q_*(s',a) \bigr ] \Bigr \} \\ &= \sum_{s', r} \bigl [ rp(s',r|s,a) \bigr ] + \sum_{s',r} \bigl [ p(s',r|s,a) \gamma \max_{a'}q_*(s',a') \bigr ] \\ &= \sum_r \bigl [ rp(r|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \mathbb E(r|s,a) + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s'} \bigl [ \mathbb E(r|s', s, a)p(s'|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s'} \Bigl \{ \bigl [ \mathbb E(r|s',s,a) + \gamma \max_{a'} q_*(s',a') \bigr ] p(s'|s,a) \Bigr \} \end{aligned} q(s,a)=E[Rt+1+γamaxq(St+1,a)St=s,At=a]=s,r{p(s,rs,a)[r+γamaxq(s,a)]}=s,r[rp(s,rs,a)]+s,r[p(s,rs,a)γamaxq(s,a)]=r[rp(rs,a)]+s[p(ss,a)γamaxq(s,a)]=E(rs,a)+s[p(ss,a)γamaxq(s,a)]=s[E(rs,s,a)p(ss,a)]+s[p(ss,a)γamaxq(s,a)]=s{[E(rs,s,a)+γamaxq(s,a)]p(ss,a)}
denote E(r∣s′,s,a)=Rs,s′a\mathbb E(r|s',s,a) = R_{s,s'}^aE(rs,s,a)=Rs,sa and p(s′∣s,a)=Ps,s′ap(s'|s,a)=P_{s,s'}^ap(ss,a)=Ps,sa, we get the expression we wanted
(1)q∗(s,a)=∑s′{[Rs,s′a+γmax⁡a′q∗(s′,a′)]Ps,s′a} q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \tag{1} q(s,a)=s{[Rs,sa+γamaxq(s,a)]Ps,sa}(1)
Next, we name the three status in circles as sAs_AsA, sBs_BsB, sCs_CsC, and denote the action to left as ala_lal, the action to right as ara_rar.
在这里插入图片描述
According to equation (1) we can get Bellman optimality equation for q∗q_*q of the three status.
q∗,πleft(sA,al)={RsA,sBal+γmax⁡a′[q∗(sB,a)]}PsA,sBal+{RsA,sCal+γmax⁡a′[q∗(sC,a)]}PsA,sCal=[RsA,sBal+γq∗(sB,a)]PsA,sBal+[RsA,sCar+γq∗(sC,a)]PsA,sCalq∗,πright(sA,ar)={RsA,sBar+γmax⁡a′[q∗(sB,a)]}PsA,sBar+{RsA,sCar+γmax⁡a′[q∗(sC,a)]}PsA,sCar=[RsA,sBar+γq∗(sB,a)]PsA,sBar+[RsA,sCar+γq∗(sC,a)]PsA,sCarq∗(sB,a)={RsB,sAa+γmax⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsB,sAaq∗(sC,a)={RsC,sAa+γmax⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsC,sAa \begin{aligned} q_{*, \pi_{left}}(s_A, a_l)&=\Bigl \{R_{s_A, s_B}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_B, a)\bigr ] \Bigr \} P_{s_A, s_B}^{a_l} + \Bigl \{R_{s_A, s_C}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_C, a) \bigr ] \Bigr \} P_{s_A, s_C}^{a_l}\\ &= \bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_l} \\ q_{*, \pi_{right}}(s_A, a_r)&=\Bigl \{R_{s_A, s_B}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_B, a) \bigr ] \Bigr \} P_{s_A, s_B}^{a_r} + \Bigl \{R_{s_A, s_C}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_C, a)\bigr ] \Bigr \} P_{s_A, s_C}^{a_r}\\ &= \bigl [ R_{s_A, s_B}^{a_r} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_r} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\ q_*(s_B, a)&=\Bigl \{R_{s_B, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_B, s_A}^{a} \\ q_*(s_C, a)&=\Bigl \{R_{s_C, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_C, s_A}^{a} \\ \end{aligned}q,πleft(sA,al)q,πright(sA,ar)q(sB,a)q(sC,a)={RsA,sBal+γamax[q(sB,a)]}PsA,sBal+{RsA,sCal+γamax[q(sC,a)]}PsA,sCal=[RsA,sBal+γq(sB,a)]PsA,sBal+[RsA,sCar+γq(sC,a)]PsA,sCal={RsA,sBar+γamax[q(sB,a)]}PsA,sBar+{RsA,sCar+γamax[q(sC,a)]}PsA,sCar=[RsA,sBar+γq(sB,a)]PsA,sBar+[RsA,sCar+γq(sC,a)]PsA,sCar={RsB,sAa+γamax[q,πleft(sA,al),q,πright(sA,ar)]}PsB,sAa={RsC,sAa+γamax[q,πleft(sA,al),q,πright(sA,ar)]}PsC,sAa∵PsA,sBar=0,PsA,sCal=0∴q∗,πleft(sA,al)=[RsA,sBal+γq∗(sB,a)]PsA,sBalq∗,πright(sA,ar)=[RsA,sCar+γq∗(sC,a)]PsA,sCar \because P_{s_A, s_B}^{a_r} = 0, P_{s_A, s_C}^{a_l} = 0\\ \begin{aligned} \therefore q_{*, \pi_{left}}(s_A, a_l)&=\bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} \\ q_{*, \pi_{right}}(s_A, a_r)&= \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\ \end{aligned} PsA,sBar=0,PsA,sCal=0q,πleft(sA,al)q,πright(sA,ar)=[RsA,sBal+γq(sB,a)]PsA,sBal=[RsA,sCar+γq(sC,a)]PsA,sCar
Now, let’s discuss the cases in different γ\gammaγ.
For γ=0\gamma = 0γ=0:
q∗,πleft(sA,al)=[1+0⋅q∗(sB,a)]⋅1=1q∗,πright(sA,ar)=[0+0⋅q∗(sC,a)]⋅1=0 \begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0 \cdot q_*(s_B, a) \bigr ] \cdot 1 = 1\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0 \cdot q_*(s_C,a) \bigr ] \cdot 1 = 0 \end{aligned} q,πleft(sA,al)q,πright(sA,ar)=[1+0q(sB,a)]1=1=[0+0q(sC,a)]1=0
So, πleft\pi_{left}πleft is the optimal policy when γ=0\gamma = 0γ=0.

For γ=0.5\gamma = 0.5γ=0.5:
q∗(sB,a)={0+0.5max⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.5max⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗(sC,a)={2+0.5max⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.5max⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗,πleft(sA,al)=[1+0.5⋅q∗(sB,a)]⋅1=1+0.5⋅q∗(sB,a)q∗,πright(sA,ar)=[0+0.5⋅q∗(sC,a)]⋅1=0.5⋅q∗(sC,a) \begin{aligned} q_*(s_B, a)&=\Bigl \{0+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\ &=0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_*(s_C, a)&=\Bigl \{2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\ &= 2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.5 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\ &= 1 + 0.5 \cdot q_*(s_B, a)\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.5 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\ &= 0.5 \cdot q_*(s_C,a) \end{aligned} q(sB,a)q(sC,a)q,πleft(sA,al)q,πright(sA,ar)={0+0.5amax[q,πleft(sA,al),q,πright(sA,ar)]}1=0.5amax[q,πleft(sA,al),q,πright(sA,ar)]={2+0.5amax[q,πleft(sA,al),q,πright(sA,ar)]}1=2+0.5amax[q,πleft(sA,al),q,πright(sA,ar)]=[1+0.5q(sB,a)]1=1+0.5q(sB,a)=[0+0.5q(sC,a)]1=0.5q(sC,a)
Assume q∗,πleft(sA,al)≥q∗,π−left(sC,al)q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)q,πleft(sA,al)q,πleft(sC,al) then we have:
q∗(sB,a)=0.5⋅q∗,πleft(sA,al)q∗(sC,a)=2+0.5⋅q∗,πleft(sA,al) \begin{aligned} q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \end{aligned} q(sB,a)q(sC,a)=0.5q,πleft(sA,al)=2+0.5q,πleft(sA,al)
therefore,
q∗,πleft(sA,al)=1+0.5⋅0.5⋅q∗,πleft(sA,al)q∗,πleft(sA,al)=43q∗,πright(sA,ar)=0.5⋅[2+0.5⋅q∗,πleft(sA,al)]q∗,πright(sA,ar)=53 \begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\ q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {5}{3} \end{aligned} q,πleft(sA,al)q,πleft(sA,al)q,πright(sA,ar)q,πright(sA,ar)=1+0.50.5q,πleft(sA,al)=34=0.5[2+0.5q,πleft(sA,al)]=35
Here, q∗,πleft(sA,al)&lt;q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q,πleft(sA,al)<q,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume q∗,πleft(sA,al)≤q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)q,πleft(sA,al)q,πright(sC,al) then we have:
q∗(sB,a)=0.5⋅q∗,πright(sA,ar)q∗(sC,a)=2+0.5⋅q∗,πright(sA,ar) \begin{aligned} q_*(s_B, a) &amp;= 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_*(s_C, a) &amp;= 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \end{aligned} q(sB,a)q(sC,a)=0.5q,πright(sA,ar)=2+0.5q,πright(sA,ar)
therefore,
q∗,πright(sA,ar)=0.5⋅[2+0.5⋅q∗,πright(sA,ar)]q∗,πright(sA,ar)=43q∗,πleft(sA,al)=1+0.5⋅0.5⋅q∗,πright(sA,ar)q∗,πleft(sA,al)=43 \begin{aligned} q_{*,\pi_{right}}(s_A, a_r) &amp;= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &amp;= \frac {4}{3}\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= \frac {4}{3}\\ \end{aligned} q,πright(sA,ar)q,πright(sA,ar)q,πleft(sA,al)q,πleft(sA,al)=0.5[2+0.5q,πright(sA,ar)]=34=1+0.50.5q,πright(sA,ar)=34
Here q∗,πleft(sA,al)=q∗,πright(sA,ar)q_{*,\pi_{left}}(s_A, a_l) = q_{*,\pi_{right}}(s_A, a_r)q,πleft(sA,al)=q,πright(sA,ar), assumption is correct. So, both q∗,πleft(sA,al)q_{*,\pi_{left}}(s_A, a_l)q,πleft(sA,al) and q∗,πright(sA,ar)q_{*,\pi_{right}}(s_A, a_r)q,πright(sA,ar) are optimal policies for γ=0.5\gamma = 0.5γ=0.5.

For γ=0.9\gamma = 0.9γ=0.9:
q∗(sB,a)={0+0.9max⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.9max⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗(sC,a)={2+0.9max⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.9max⁡a′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗,πleft(sA,al)=[1+0.9⋅q∗(sB,a)]⋅1=1+0.9⋅q∗(sB,a)q∗,πright(sA,ar)=[0+0.9⋅q∗(sC,a)]⋅1=0.9⋅q∗(sC,a) \begin{aligned} q_*(s_B, a)&amp;=\Bigl \{0+0.9 \max_{a&#x27;} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\ &amp;=0.9 \max_{a&#x27;} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_*(s_C, a)&amp;=\Bigl \{2+0.9 \max_{a&#x27;} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\ &amp;= 2+0.9 \max_{a&#x27;} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= \bigl [ 1 + 0.9 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\ &amp;= 1 + 0.9 \cdot q_*(s_B, a)\\ q_{*,\pi_{right}}(s_A, a_r) &amp;= \bigl [ 0 + 0.9 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\ &amp;= 0.9 \cdot q_*(s_C,a) \end{aligned} q(sB,a)q(sC,a)q,πleft(sA,al)q,πright(sA,ar)={0+0.9amax[q,πleft(sA,al),q,πright(sA,ar)]}1=0.9amax[q,πleft(sA,al),q,πright(sA,ar)]={2+0.9amax[q,πleft(sA,al),q,πright(sA,ar)]}1=2+0.9amax[q,πleft(sA,al),q,πright(sA,ar)]=[1+0.9q(sB,a)]1=1+0.9q(sB,a)=[0+0.9q(sC,a)]1=0.9q(sC,a)
Assume q∗,πleft(sA,al)≥q∗,π−left(sC,al)q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)q,πleft(sA,al)q,πleft(sC,al) then we have:
q∗(sB,a)=0.9⋅q∗,πleft(sA,al)q∗(sC,a)=2+0.9⋅q∗,πleft(sA,al) \begin{aligned} q_*(s_B, a) &amp;= 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_*(s_C, a) &amp;= 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \end{aligned} q(sB,a)q(sC,a)=0.9q,πleft(sA,al)=2+0.9q,πleft(sA,al)
therefore,
q∗,πleft(sA,al)=1+0.9⋅0.9⋅q∗,πleft(sA,al)q∗,πleft(sA,al)=10019=50095q∗,πright(sA,ar)=0.9⋅[2+0.9⋅q∗,πleft(sA,al)]q∗,πright(sA,ar)=72995 \begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &amp;= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= \frac {100}{19} = \frac {500}{95}\\ q_{*,\pi_{right}}(s_A, a_r) &amp;= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &amp;= \frac {729}{95} \end{aligned} q,πleft(sA,al)q,πleft(sA,al)q,πright(sA,ar)q,πright(sA,ar)=1+0.90.9q,πleft(sA,al)=19100=95500=0.9[2+0.9q,πleft(sA,al)]=95729
Here, q∗,πleft(sA,al)&lt;q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q,πleft(sA,al)<q,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume q∗,πleft(sA,al)≤q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)q,πleft(sA,al)q,πright(sC,al) then we have:
q∗(sB,a)=0.9⋅q∗,πright(sA,ar)q∗(sC,a)=2+0.9⋅q∗,πright(sA,ar) \begin{aligned} q_*(s_B, a) &amp;= 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_*(s_C, a) &amp;= 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \end{aligned} q(sB,a)q(sC,a)=0.9q,πright(sA,ar)=2+0.9q,πright(sA,ar)
therefore,
q∗,πright(sA,ar)=0.9⋅[2+0.9⋅q∗,πright(sA,ar)]q∗,πright(sA,ar)=18019q∗,πleft(sA,al)=1+0.9⋅0.9⋅q∗,πright(sA,ar)q∗,πleft(sA,al)=1648190 \begin{aligned} q_{*,\pi_{right}}(s_A, a_r) &amp;= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &amp;= \frac {180}{19}\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_{*,\pi_{left}}(s_A, a_l) &amp;= \frac {1648}{190}\\ \end{aligned} q,πright(sA,ar)q,πright(sA,ar)q,πleft(sA,al)q,πleft(sA,al)=0.9[2+0.9q,πright(sA,ar)]=19180=1+0.90.9q,πright(sA,ar)=1901648
Here, q∗,πleft(sA,al)&lt;q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q,πleft(sA,al)<q,πright(sC,al), assumption is correct. So, πright\pi_{right}πright is the optimal policy for γ=0.9\gamma = 0.9γ=0.9

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值