Policy Improvement and Policy Iteration

上一篇文章介绍了策略评估,本文讨论如何改进给定策略以获得最优策略。已知评估后的策略,可得到每个状态的动作价值函数。以谜题漫游为例,可利用状态价值函数改进策略,通过贪心选择动作,不断进行评估和改进,直至策略稳定。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

From the last post, we know how to evaluate a policy. But that's not enough, because the purpose of policy evaluation is to improve policies so that finally get the optimal policy. So in this post, we will discuss about how to improve a given policy, and how to from a given policy get to the optimal policy.

 

Firstly, when you have an evaluated policy, the Action-Value function is known for every state. That is, at a certain state s, we known which action can give the system the largest reward.

In the puzzle wandering example, we evaluate the random policy. However,the State-Value functions can be used for policy improvement. After 1 step calculating,we can conclude at the circled location, moving left is better than randomly picking a direction because left side has more reward.

After three steps, we've got a much better intuition about the map. We can change the random policy to a new better one.

 

The way to improve the current policy is to greedyly pick actions for every state. It is worth noting that greedily picking actions does not means it only consider one step (too greedy to consider multiple steps). Instead, when k=3, the algorithm can foresee three steps, and the greedy picking algorithm will select the best action for k steps.

 

The Policy Iteration Algorithm is keep doing evaluation and improvement tasks untill the policy becomes stable,

 

This process means Action-Value function of the improved policy picking the best return from a single action:

 

The algorithm is:

转载于:https://www.cnblogs.com/rhyswang/p/11174493.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值