Reinforcement Learning Exercise 4.5

最新推荐文章于 2022-03-09 16:33:40 发布

原创最新推荐文章于 2022-03-09 16:33:40 发布 · 529 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#reinforcement learning

reinforcement learning 专栏收录该内容

37 篇文章

订阅专栏

Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing $q_*$ , analogous to that on page 80 for computing $v_*$ . Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.

Here, we can use the result of exercise 3.17:
$Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a$
Then the algorithm which analogous to that on page 80 can be like this:
$\begin{aligned} &\text{1 Initialization} \\ &\qquad Q_\pi(s, a) \in \mathbb R and \pi (s, a) \in \mathcal A(s) arbitrarily s \in \mathcal S and a \in \mathcal A(s) \\ &\text{2 Policy Evaluation} \\ &\qquad \text{Loop:} \\ &\qquad \qquad \Delta \leftarrow 0 \\ &\qquad \text{Loop for each }(s,a)\text{ pair:} \\ &\qquad \qquad q \leftarrow Q_\pi(s,a) \\ &\qquad \qquad Q_\pi(s,a) \leftarrow \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a \\ & \qquad \qquad \Delta \leftarrow \max (\Delta , |q-Q_\pi(s,a)|) \\ &\qquad \text{until }\Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &\text{3 Policy Improvement} \\ &\qquad policy\text-stable \leftarrow true \\ &\qquad \text{For each }(a,s) \text{ pair, } s \in \mathcal S \text{ and } a \in \mathcal A(s): \\ &\qquad \qquad old \text- action \leftarrow \pi(s, a) \\ &\qquad \qquad \pi(s) \leftarrow \text{argmax}_{s,a} Q_\pi(s,a) \\ &\qquad \qquad \text{If } old\text-action =\not \pi(s) \text{, then } policy\text-stable \leftarrow false \\ &\qquad \text{If } policy\text-stable \text{, then stop and return } Q_\pi \approx q_* \text{ and }\pi \approx \pi_* \text{; else go to 2.} \\ \end{aligned}$