The book doesn’t explain the formula (5.2) clearly, and the second and third lines of the formula (5.2) in page 101 made me confused. So, here, I make it clear to be understood.
First,
qπ(s,π′(s))=∑aπ′(a∣s)qπ(s,a)∵for all π(a∣s),there is π(a∣s)={1−ϵ+ϵ/∣A(s)∣if a=A∗ϵ/∣A(s)∣if a≠A∗∴qπ(s,π′(s))=∑a(a≠A∗)ϵ∣A(s)∣qπ(s,a)+(1−ϵ+ϵ∣A(s)∣)qπ(s,a=A∗)=ϵ∣A(s)∣∑a(a≠A∗)qπ(s,a)+ϵ∣A(s)∣qπ(s,a=A∗)+(1−ϵ)qπ(s,a=A∗)=ϵ∣A(s)∣∑aqπ(s,a)+(1−ϵ)maxaqπ(s,a)this is the second line of formula (5.2)
q_\pi(s, \pi'(s)) = \sum_a \pi'(a \mid s) q_\pi(s,a) \\
\because \text{for all }\pi(a \mid s), \text{there is } \pi(a \mid s) =
\begin{cases}
1 - \epsilon + \epsilon / | \mathcal A(s)| & \text{if } a = A^* \\
\epsilon / | \mathcal A(s) |& \text{if } a = \not A^* \\
\end{cases} \\
\begin{aligned}
\therefore q_\pi(s, \pi'(s)) &= \sum_{a(a = \not A^*)} \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a) + (1 - \epsilon + \frac{\epsilon}{|\mathcal A(s)|})q_\pi(s,a = A^*) \\
&=\frac {\epsilon} {| \mathcal A(s) |} \sum_{a(a = \not A^*)} q_\pi(s,a) + \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a = A^*) + (1-\epsilon)q_\pi(s,a = A^*) \\
&= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \qquad \text{this is the second line of formula (5.2)}
\end{aligned}
qπ(s,π′(s))=a∑π′(a∣s)qπ(s,a)∵for all π(a∣s),there is π(a∣s)={1−ϵ+ϵ/∣A(s)∣ϵ/∣A(s)∣if a=A∗if a≠A∗∴qπ(s,π′(s))=a(a≠A∗)∑∣A(s)∣ϵqπ(s,a)+(1−ϵ+∣A(s)∣ϵ)qπ(s,a=A∗)=∣A(s)∣ϵa(a≠A∗)∑qπ(s,a)+∣A(s)∣ϵqπ(s,a=A∗)+(1−ϵ)qπ(s,a=A∗)=∣A(s)∣ϵa∑qπ(s,a)+(1−ϵ)amaxqπ(s,a)this is the second line of formula (5.2)
Consider value xxx, let
x=∑a[π(a∣s)−ϵ∣A(s)∣]qπ(s,a)
x =\sum_a \Bigl [ \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s,a)
x=a∑[π(a∣s)−∣A(s)∣ϵ]qπ(s,a)
When a≠A∗a = \not A^*a≠A∗, π(a∣s)=ϵ/∣A(s)∣\pi(a \mid s) = \epsilon/| \mathcal A(s) |π(a∣s)=ϵ/∣A(s)∣
∴x=[π(a=A∗∣s)−ϵ∣A(s)∣]qπ(s,a=A∗)=[1−ϵ+ϵ∣A(s)∣−ϵ∣A(s)∣]qπ(s,a=A∗)=(1−ϵ)qπ(s,a=A∗)=(1−ϵ)maxaqπ(s,a)≤maxaqπ(s,a)
\begin{aligned}
\therefore x &= \Bigl [ \pi(a = A^* \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s, a = A^*) \\
&= \Bigl [ 1 - \epsilon + \frac {\epsilon}{| \mathcal A(s) |} - \frac {\epsilon}{| \mathcal A(s) |}\Bigr ]q_\pi(s, a=A^*) \\
&= ( 1 - \epsilon) q_\pi(s, a=A^*) \\
&= (1-\epsilon)\max_aq_\pi(s,a) \\
&\leq \max_a q_\pi(s,a)
\end{aligned}
∴x=[π(a=A∗∣s)−∣A(s)∣ϵ]qπ(s,a=A∗)=[1−ϵ+∣A(s)∣ϵ−∣A(s)∣ϵ]qπ(s,a=A∗)=(1−ϵ)qπ(s,a=A∗)=(1−ϵ)amaxqπ(s,a)≤amaxqπ(s,a)
Also
x=(1−ϵ)∑aπ(a∣s)−ϵ∣A(s)∣1−ϵqπ(s,a)
x = (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a)
x=(1−ϵ)a∑1−ϵπ(a∣s)−∣A(s)∣ϵqπ(s,a)
∴qπ(s,π′(s))=ϵ∣A(s)∣∑aqπ(s,a)+(1−ϵ)maxaqπ(s,a)≥ϵ∣A(s)∣∑aqπ(s,a)+(1−ϵ)∑aπ(a∣s)−ϵ∣A(s)∣1−ϵqπ(s,a)
\begin{aligned}
\therefore
q_\pi(s, \pi'(s)) &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \\
& \geq \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a)
\end{aligned}
∴qπ(s,π′(s))=∣A(s)∣ϵa∑qπ(s,a)+(1−ϵ)amaxqπ(s,a)≥∣A(s)∣ϵa∑qπ(s,a)+(1−ϵ)a∑1−ϵπ(a∣s)−∣A(s)∣ϵqπ(s,a)
This is the third line of formula (5.2). It’s clear to be understood now.
Reinforcement Learning--Explanation to Formula (5.2)
最新推荐文章于 2022-10-30 12:00:28 发布