Exercise 4.5 How would policy iteration be defined for action values? Give a complete algorithm for computing q∗q_*q∗, analogous to that on page 80 for computing v∗v_*v∗. Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.
Here, we can use the result of exercise 3.17:
Qπ(s,a)=∑s′Rs,s′aPss′a+γ∑s′[∑a′Qπ(s′,a′)π(s′,a′)]Ps,s′a
Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a
Qπ(s,a)=s′∑Rs,s′aPss′a+γs′∑[a′∑Qπ(s′,a′)π(s′,a′)]Ps,s′a
Then the algorithm which analogous to that on page 80 can be like this:
1 InitializationQπ(s,a)∈Randπ(s,a)∈A(s)arbitrarilys∈Sanda∈A(s)2 Policy EvaluationLoop:Δ←0Loop for each (s,a) pair:q←Qπ(s,a)Qπ(s,a)←∑s′Rs,s′aPss′a+γ∑s′[∑a′Qπ(s′,a′)π(s′,a′)]Ps,s′aΔ←max(Δ,∣q−Qπ(s,a)∣)until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stable←trueFor each (a,s) pair, s∈S and a∈A(s):old-action←π(s,a)π(s)←argmaxs,aQπ(s,a)If old-action≠π(s), then policy-stable←falseIf policy-stable, then stop and return Qπ≈q∗ and π≈π∗; else go to 2.
\begin{aligned}
&\text{1 Initialization} \\
&\qquad Q_\pi(s, a) \in \mathbb R and \pi (s, a) \in \mathcal A(s) arbitrarily s \in \mathcal S and a \in \mathcal A(s) \\
&\text{2 Policy Evaluation} \\
&\qquad \text{Loop:} \\
&\qquad \qquad \Delta \leftarrow 0 \\
&\qquad \text{Loop for each }(s,a)\text{ pair:} \\
&\qquad \qquad q \leftarrow Q_\pi(s,a) \\
&\qquad \qquad Q_\pi(s,a) \leftarrow \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a \\
& \qquad \qquad \Delta \leftarrow \max (\Delta , |q-Q_\pi(s,a)|) \\
&\qquad \text{until }\Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\
&\text{3 Policy Improvement} \\
&\qquad policy\text-stable \leftarrow true \\
&\qquad \text{For each }(a,s) \text{ pair, } s \in \mathcal S \text{ and } a \in \mathcal A(s): \\
&\qquad \qquad old \text- action \leftarrow \pi(s, a) \\
&\qquad \qquad \pi(s) \leftarrow \text{argmax}_{s,a} Q_\pi(s,a) \\
&\qquad \qquad \text{If } old\text-action =\not \pi(s) \text{, then } policy\text-stable \leftarrow false \\
&\qquad \text{If } policy\text-stable \text{, then stop and return } Q_\pi \approx q_* \text{ and }\pi \approx \pi_* \text{; else go to 2.} \\
\end{aligned}
1 InitializationQπ(s,a)∈Randπ(s,a)∈A(s)arbitrarilys∈Sanda∈A(s)2 Policy EvaluationLoop:Δ←0Loop for each (s,a) pair:q←Qπ(s,a)Qπ(s,a)←s′∑Rs,s′aPss′a+γs′∑[a′∑Qπ(s′,a′)π(s′,a′)]Ps,s′aΔ←max(Δ,∣q−Qπ(s,a)∣)until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stable←trueFor each (a,s) pair, s∈S and a∈A(s):old-action←π(s,a)π(s)←argmaxs,aQπ(s,a)If old-action≠π(s), then policy-stable←falseIf policy-stable, then stop and return Qπ≈q∗ and π≈π∗; else go to 2.