Reinforcement Learning Exercise 4.3

最新推荐文章于 2025-10-04 21:16:27 发布

原创最新推荐文章于 2025-10-04 21:16:27 发布 · 411 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#reinforcement learning

reinforcement learning 专栏收录该内容

37 篇文章

订阅专栏

本文详细探讨了强化学习中Q函数的迭代更新公式，基于π策略下状态动作值函数Qπ的定义，推导出了其迭代逼近公式。通过将前一次迭代的Q值估计作为输入，为理解强化学习算法的收敛性和稳定性提供了数学基础。

Exercise 4.3 What are the equations analogous to (4.3), (4.4), and (4.5) for the action-value function $qπq_\pi$ and its successive approximation by a sequence of functions $q_0, q_1, q_2, . . .$ ?

According to the result of exercise 3.17, we have:
$Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a$
Let $QkπQ_k^\pi$ be the previous estimated value of $QπQ_\pi$ and substitute it to the right side of the equation. For the next iteration, $Qk+1πQ_{k+1}^\pi$ can be:
$Q_{k+1}^\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_k^\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a$