Policy Gradient Methods in Reinforcement Learning
1. Introduction to Temporal Difference and Policy Gradient Concepts
In reinforcement learning, understanding the concepts of temporal difference and policy gradient methods is crucial.
1.1 Temporal Difference Learning
When $\lambda = 1$, the approach is equivalent to using Monte - Carlo evaluations to compute the ground - truth. At $\lambda = 1$, new error information is used to fully correct past mistakes without discount, resulting in an unbiased estimate. Here, $\lambda$ is used for step discounting, while $\gamma$ is used in computing the TD - error $\delta_t$. The parameter $\lambda$ is algorithm - specific, and $\gamma$ is environment - specific.
Using $\lambda = 1$ or Monte
超级会员免费看
订阅专栏 解锁全文
799

被折叠的 条评论
为什么被折叠?



