Exercise 4.6 Suppose you are restricted to considering only policies that are ϵ\epsilonϵ-soft, meaning that the probability of selecting each action in each state, sss, is at least ϵ/∣A(s)∣\epsilon/|\mathcal A(s)|ϵ/∣A(s)∣. Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for v∗v_*v∗ on page 80.
The algorithm on page 80 in section 4.2 is based on the assumption that the policy is deterministic. For a stochastic case, we can modify the algorithm like this:
1 InitializationV(s)∈R and π(s)∈A(s) arbitrarily s∈S2 Policy EvaluationLoop:Δ←0Loop for each s∈S:v←V(s)V(s)←∑s′,rπ(a∣s)p(s′,r∣s,a)[r+γV(s′)]Δ←max(Δ,∣v−V(s)∣)until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stable←trueFor each s∈S:old-action←π(s)π(s)←argmaxa∑s′,rπ(a∣s)p(s′,r∣s,a)[r+γV(s′)]If old-action≠π(s), then policy-stable←falseIf policy-stable, then stop and return V≈v∗ and π≈π∗; else go to 2. \begin{aligned} &\text{1 Initialization} \\ &\qquad V(s) \in \mathbb R \text{ and }\pi (s) \in \mathcal A(s) \text{ arbitrarily } s \in \mathcal S \\ &\text{2 Policy Evaluation} \\ &\qquad \text{Loop:} \\ &\qquad \qquad \Delta \leftarrow 0 \\ &\qquad \qquad \text{Loop for each }s \in \mathcal S: \\ &\qquad \qquad \qquad v \leftarrow V(s) \\ &\qquad \qquad \qquad V(s) \leftarrow \sum_{s',r}\pi(a \mid s)p(s',r \mid s,a) \Bigl [ r + \gamma V(s')\Bigr ] \\ &\qquad \qquad \qquad \Delta \leftarrow \max (\Delta , |v-V(s)|) \\ &\qquad \qquad \text{until } \Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &\text{3 Policy Improvement} \\ &\qquad policy\text-stable \leftarrow true \\ &\qquad \text{For each }s \in \mathcal S: \\ &\qquad \qquad old \text- action \leftarrow \pi(s) \\ &\qquad \qquad \pi(s) \leftarrow \text{argmax}_a \sum_{s',r}\pi(a \mid s)p(s',r \mid s,a) \Bigl [ r + \gamma V(s')\Bigr ] \\ &\qquad \qquad \text{If }old\text-action =\not \pi(s) \text{, then }policy\text-stable \leftarrow false \\ &\qquad \text{If } policy\text-stable \text{, then stop and return } V \approx v_* \text{ and }\pi \approx \pi_*\text{; else go to 2.} \\ \end{aligned} 1 InitializationV(s)∈R and π(s)∈A(s) arbitrarily s∈S2 Policy EvaluationLoop:Δ←0Loop for each s∈S:v←V(s)V(s)←s′,r∑π(a∣s)p(s′,r∣s,a)[r+γV(s′)]Δ←max(Δ,∣v−V(s)∣)until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stable←trueFor each s∈S:old-action←π(s)π(s)←argmaxas′,r∑π(a∣s)p(s′,r∣s,a)[r+γV(s′)]If old-action≠π(s), then policy-stable←falseIf policy-stable, then stop and return V≈v∗ and π≈π∗; else go to 2.
Because the only policies are ϵ\epsilonϵ-soft, the probability that the policy doesn’t select action aaa is ϵ∣A(s)∣⋅(∣A(s)∣−1)\frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1)∣A(s)∣ϵ⋅(∣A(s)∣−1). So,
π(a∣s)=1−ϵ∣A(s)∣⋅(∣A(s)∣−1)=1−ϵ+ϵ∣A(s)∣
\begin{aligned}
\pi(a \mid s) &= 1 - \frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1) \\
&= 1 - \epsilon + \frac {\epsilon}{|\mathcal A(s)|}
\end{aligned}
π(a∣s)=1−∣A(s)∣ϵ⋅(∣A(s)∣−1)=1−ϵ+∣A(s)∣ϵ
Substitute this π(a∣s)\pi(a \mid s)π(a∣s) into the algorithm, we can get the final result.