Reinforcement Learning Exercise 5.5

最新推荐文章于 2022-03-09 16:33:40 发布

原创

最新推荐文章于 2022-03-09 16:33:40 发布 · 1.2k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#reinforcement learning

在含有单一非终态状态和单一行动的MDP中，概率p返回非终态，1-p进入终态，奖励始终为1，γ=1。在一个持续10步并得到回报10的实验中，首次访问和每次访问状态下价值估计分别为p和1-p/(p^10-1)。

Exercise 5.5 Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability $p$ and transitions to the terminal state with probability $1 - p$ . Let the reward be $+ 1$ on all transitions, and let $γ=1\gamma=1$ . Suppose you observe one episode that lasts 10 steps, with a return of 10. What are the first-visit and every-visit estimators of the value of the nonterminal state?

For the first-visit estimator, only the first visit of a state is considered. So:
$\begin{aligned} V(S_{nonterminal}) &= G(S_0) \\ &= 1 \cdot p + 0 \cdot (1-p) \\ &= p \end{aligned}$