Reinforcement Learning Exercise 4.5

Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing q∗q_*q, analogous to that on page 80 for computing v∗v_*v. Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.

Here, we can use the result of exercise 3.17:
Qπ(s,a)=∑s′Rs,s′aPss′a+γ∑s′[∑a′Qπ(s′,a′)π(s′,a′)]Ps,s′a Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a Qπ(s,a)=sRs,saPssa+γs[aQπ(s,a)π(s,a)]Ps,sa
Then the algorithm which analogous to that on page 80 can be like this:
1 InitializationQπ(s,a)∈Randπ(s,a)∈A(s)arbitrarilys∈Sanda∈A(s)2 Policy EvaluationLoop:Δ←0Loop for each (s,a) pair:q←Qπ(s,a)Qπ(s,a)←∑s′Rs,s′aPss′a+γ∑s′[∑a′Qπ(s′,a′)π(s′,a′)]Ps,s′aΔ←max⁡(Δ,∣q−Qπ(s,a)∣)until Δ&lt;θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stable←trueFor each (a,s) pair, s∈S and a∈A(s):old-action←π(s,a)π(s)←argmaxs,aQπ(s,a)If old-action≠π(s), then policy-stable←falseIf policy-stable, then stop and return Qπ≈q∗ and π≈π∗; else go to 2. \begin{aligned} &amp;\text{1 Initialization} \\ &amp;\qquad Q_\pi(s, a) \in \mathbb R and \pi (s, a) \in \mathcal A(s) arbitrarily s \in \mathcal S and a \in \mathcal A(s) \\ &amp;\text{2 Policy Evaluation} \\ &amp;\qquad \text{Loop:} \\ &amp;\qquad \qquad \Delta \leftarrow 0 \\ &amp;\qquad \text{Loop for each }(s,a)\text{ pair:} \\ &amp;\qquad \qquad q \leftarrow Q_\pi(s,a) \\ &amp;\qquad \qquad Q_\pi(s,a) \leftarrow \sum_{s&#x27;} R_{s,s&#x27;}^a P_{ss&#x27;}^a + \gamma \sum_{s&#x27;} \bigl[ \sum_{a&#x27;} Q_\pi(s&#x27;,a&#x27;) \pi(s&#x27;,a&#x27;) \bigr] P_{s,s&#x27;}^a \\ &amp; \qquad \qquad \Delta \leftarrow \max (\Delta , |q-Q_\pi(s,a)|) \\ &amp;\qquad \text{until }\Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &amp;\text{3 Policy Improvement} \\ &amp;\qquad policy\text-stable \leftarrow true \\ &amp;\qquad \text{For each }(a,s) \text{ pair, } s \in \mathcal S \text{ and } a \in \mathcal A(s): \\ &amp;\qquad \qquad old \text- action \leftarrow \pi(s, a) \\ &amp;\qquad \qquad \pi(s) \leftarrow \text{argmax}_{s,a} Q_\pi(s,a) \\ &amp;\qquad \qquad \text{If } old\text-action =\not \pi(s) \text{, then } policy\text-stable \leftarrow false \\ &amp;\qquad \text{If } policy\text-stable \text{, then stop and return } Q_\pi \approx q_* \text{ and }\pi \approx \pi_* \text{; else go to 2.} \\ \end{aligned} 1 InitializationQπ(s,a)Randπ(s,a)A(s)arbitrarilysSandaA(s)2 Policy EvaluationLoop:Δ0Loop for each (s,a) pair:qQπ(s,a)Qπ(s,a)sRs,saPssa+γs[aQπ(s,a)π(s,a)]Ps,saΔmax(Δ,qQπ(s,a))until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stabletrueFor each (a,s) pair, sS and aA(s):old-actionπ(s,a)π(s)argmaxs,aQπ(s,a)If old-action≠π(s), then policy-stablefalseIf policy-stable, then stop and return Qπq and ππ; else go to 2.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值