Reinforcement Learning--Explanation to Formula (5.2)

最新推荐文章于 2022-10-30 12:00:28 发布

原创最新推荐文章于 2022-10-30 12:00:28 发布 · 209 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#reinforcement learning

reinforcement learning 专栏收录该内容

37 篇文章

订阅专栏

The book doesn’t explain the formula (5.2) clearly, and the second and third lines of the formula (5.2) in page 101 made me confused. So, here, I make it clear to be understood.
First,
$q_\pi(s, \pi'(s)) = \sum_a \pi'(a \mid s) q_\pi(s,a) \\ \because \text{for all }\pi(a \mid s), \text{there is } \pi(a \mid s) = \begin{cases} 1 - \epsilon + \epsilon / | \mathcal A(s)| & \text{if } a = A^* \\ \epsilon / | \mathcal A(s) |& \text{if } a = \not A^* \\ \end{cases} \\ \begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \sum_{a(a = \not A^*)} \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a) + (1 - \epsilon + \frac{\epsilon}{|\mathcal A(s)|})q_\pi(s,a = A^*) \\ &=\frac {\epsilon} {| \mathcal A(s) |} \sum_{a(a = \not A^*)} q_\pi(s,a) + \frac {\epsilon} {| \mathcal A(s) |}q_\pi(s,a = A^*) + (1-\epsilon)q_\pi(s,a = A^*) \\ &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \qquad \text{this is the second line of formula (5.2)} \end{aligned}$
Consider value $x$ , let
$=\sum_a \Bigl [ \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s,a)$
When $\not A^*$ , $π(a∣s)=ϵ/∣A(s)∣\pi(a \mid s) = \epsilon/| \mathcal A(s) |$
$\begin{aligned} \therefore x &= \Bigl [ \pi(a = A^* \mid s) - \frac {\epsilon}{| \mathcal A(s) |} \Bigr ]q_\pi(s, a = A^*) \\ &= \Bigl [ 1 - \epsilon + \frac {\epsilon}{| \mathcal A(s) |} - \frac {\epsilon}{| \mathcal A(s) |}\Bigr ]q_\pi(s, a=A^*) \\ &= ( 1 - \epsilon) q_\pi(s, a=A^*) \\ &= (1-\epsilon)\max_aq_\pi(s,a) \\ &\leq \max_a q_\pi(s,a) \end{aligned}$
Also
$(1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a)$
$\begin{aligned} \therefore q_\pi(s, \pi'(s)) &= \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \max_a q_\pi(s,a) \\ & \geq \frac {\epsilon} {| \mathcal A(s) |} \sum_{a} q_\pi(s,a) + (1-\epsilon) \sum_a \frac { \pi(a \mid s) - \frac {\epsilon}{| \mathcal A(s) |} }{ 1 - \epsilon}q_\pi(s,a) \end{aligned}$
This is the third line of formula (5.2). It’s clear to be understood now.