Reinforcement Learning Exercise 4.10

最新推荐文章于 2021-06-29 17:58:34 发布

原创最新推荐文章于 2021-06-29 17:58:34 发布 · 323 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#reinforcement learning

reinforcement learning 专栏收录该内容

37 篇文章

订阅专栏

本文深入探讨了强化学习中Q值迭代更新公式的推导过程，基于第3.17题的结果，详细解释了如何从状态价值函数的迭代更新公式(4.10)推导出动作价值函数q_k+1(s,a)的迭代更新公式。

Exercise 4.10 What is the analog of the value iteration update (4.10) for action values, $q_{k+1}(s, a)$ ?
Use the result of exercise 3.17:
$Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a$
easily, we have the iteration for $q_{k+1}(s,a)$ which is analogous to the value iteration update (4.10):
$q_{k+1}(s, a) = \max_a \biggl \{ \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} q_k(s',a') \pi(s',a') \bigr] P_{s,s'}^a \biggr \}$