Reinforcement Learning Exercise 4.6

Exercise 4.6 Suppose you are restricted to considering only policies that are ϵ\epsilonϵ-soft, meaning that the probability of selecting each action in each state, sss, is at least ϵ/∣A(s)∣\epsilon/|\mathcal A(s)|ϵ/A(s). Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for v∗v_*v on page 80.

The algorithm on page 80 in section 4.2 is based on the assumption that the policy is deterministic. For a stochastic case, we can modify the algorithm like this:

1 InitializationV(s)∈R and π(s)∈A(s) arbitrarily s∈S2 Policy EvaluationLoop:Δ←0Loop for each s∈S:v←V(s)V(s)←∑s′,rπ(a∣s)p(s′,r∣s,a)[r+γV(s′)]Δ←max⁡(Δ,∣v−V(s)∣)until Δ&lt;θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stable←trueFor each s∈S:old-action←π(s)π(s)←argmaxa∑s′,rπ(a∣s)p(s′,r∣s,a)[r+γV(s′)]If old-action≠π(s), then policy-stable←falseIf policy-stable, then stop and return V≈v∗ and π≈π∗; else go to 2. \begin{aligned} &amp;\text{1 Initialization} \\ &amp;\qquad V(s) \in \mathbb R \text{ and }\pi (s) \in \mathcal A(s) \text{ arbitrarily } s \in \mathcal S \\ &amp;\text{2 Policy Evaluation} \\ &amp;\qquad \text{Loop:} \\ &amp;\qquad \qquad \Delta \leftarrow 0 \\ &amp;\qquad \qquad \text{Loop for each }s \in \mathcal S: \\ &amp;\qquad \qquad \qquad v \leftarrow V(s) \\ &amp;\qquad \qquad \qquad V(s) \leftarrow \sum_{s&#x27;,r}\pi(a \mid s)p(s&#x27;,r \mid s,a) \Bigl [ r + \gamma V(s&#x27;)\Bigr ] \\ &amp;\qquad \qquad \qquad \Delta \leftarrow \max (\Delta , |v-V(s)|) \\ &amp;\qquad \qquad \text{until } \Delta \lt \theta \text{ (a small positive number determining the accuracy of estimation) } \\ &amp;\text{3 Policy Improvement} \\ &amp;\qquad policy\text-stable \leftarrow true \\ &amp;\qquad \text{For each }s \in \mathcal S: \\ &amp;\qquad \qquad old \text- action \leftarrow \pi(s) \\ &amp;\qquad \qquad \pi(s) \leftarrow \text{argmax}_a \sum_{s&#x27;,r}\pi(a \mid s)p(s&#x27;,r \mid s,a) \Bigl [ r + \gamma V(s&#x27;)\Bigr ] \\ &amp;\qquad \qquad \text{If }old\text-action =\not \pi(s) \text{, then }policy\text-stable \leftarrow false \\ &amp;\qquad \text{If } policy\text-stable \text{, then stop and return } V \approx v_* \text{ and }\pi \approx \pi_*\text{; else go to 2.} \\ \end{aligned} 1 InitializationV(s)R and π(s)A(s) arbitrarily sS2 Policy EvaluationLoop:Δ0Loop for each sS:vV(s)V(s)s,rπ(as)p(s,rs,a)[r+γV(s)]Δmax(Δ,vV(s))until Δ<θ (a small positive number determining the accuracy of estimation) 3 Policy Improvementpolicy-stabletrueFor each sS:old-actionπ(s)π(s)argmaxas,rπ(as)p(s,rs,a)[r+γV(s)]If old-action≠π(s), then policy-stablefalseIf policy-stable, then stop and return Vv and ππ; else go to 2.

Because the only policies are ϵ\epsilonϵ-soft, the probability that the policy doesn’t select action aaa is ϵ∣A(s)∣⋅(∣A(s)∣−1)\frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1)A(s)ϵ(A(s)1). So,
π(a∣s)=1−ϵ∣A(s)∣⋅(∣A(s)∣−1)=1−ϵ+ϵ∣A(s)∣ \begin{aligned} \pi(a \mid s) &amp;= 1 - \frac{\epsilon}{|\mathcal A(s)|}\cdot (|\mathcal A(s)| - 1) \\ &amp;= 1 - \epsilon + \frac {\epsilon}{|\mathcal A(s)|} \end{aligned} π(as)=1A(s)ϵ(A(s)1)=1ϵ+A(s)ϵ
Substitute this π(a∣s)\pi(a \mid s)π(as) into the algorithm, we can get the final result.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值