Reinforcement Learning Overview
文章平均质量分 91
超级超级小天才
这个作者很懒,什么都没留下…
展开
专栏收录文章
- 默认排序
- 最新发布
- 最早发布
- 最多阅读
- 最少阅读
-
[Chapter 6] Reinforcement Learning (4) Policy Search
In the previous sections, we try to learn the utility function, or more usually, the action-value functions and greedily select the action with the highest Q-value:π(s)=argmaxaQ(s,a){\pi}(s)=arg max_a{Q(s,a)}π(s)=argmaxaQ(s,a)This means that once原创 2021-05-30 11:50:30 · 299 阅读 · 0 评论 -
[Chapter 5] Reinforcement Learning(3)Function Approximation
Function ApproximationWhile we are learning the Q-functions, but how to represent or record the Q-values? For discrete and finite state space and action space, we can use a big table with size of ∣S∣×∣A∣|S| \times |A|∣S∣×∣A∣ to represent the Q-values for原创 2021-05-30 10:04:46 · 243 阅读 · 0 评论 -
[Chapter 4] Reinforcement Learning (2) Model-Free Method
Model-Free RL MethodIn model-based method, we need firstly model the environment by learning/estimating the transition and reward functions. However, in model-free method, we consider learning the value/utility functions V(s)V(s)V(s) or U(s)U(s)U(s) or ac原创 2021-05-29 14:42:02 · 341 阅读 · 0 评论 -
[Chapter 3] Reinforcement Learning (1) Model-Based Method
Reinforcement LearningFirstly, we assume that all the environments in the following materials are all modeled by Markov decision processes. As we have known, an MDP model can be represented by a tuple (S,A,T,R)(S,A,T,R)(S,A,T,R), the rewards are returned原创 2021-05-29 12:54:12 · 327 阅读 · 0 评论 -
[Chapter 2] Value Iteration and Policy Iteration
We now know the most important thing for computing an optimal policy is to compute the value function. But how? (The following contents are all based on infinite horizon problems.) The solution to this problem can be roughly divided into two categories: Va原创 2021-05-28 23:03:22 · 512 阅读 · 1 评论 -
[Chapter 1] Markov Decision Process and Value Function
Markov Decision ProcessOne of the most important problems in decision making is to make sequential decisions, which is also the agent’s utility depends on. At each time step, the agent selects some actions to interact with the environment and make it tran原创 2021-05-28 14:32:56 · 357 阅读 · 0 评论
分享