目录
Value Iteration
Background
Policy iteration's process is that after value function (policy evaluation) converges, the policy then improved. In fact, there is no need that value function converges because even no the last many sweeps of policy evaluation, the policy can converge to the same result. Therefore, actually we do not need to make policy evaluation converged.
Definition
We can go on several sweeps of policy evaluation, then turn into policy improvement and turn back to the policy evaluation until the policy converges to the optimal policy.
In extreme situations, we can make policy improvement just after one update of one state.
Pesudo code
while judge == True:
judge = False;
for :
if :
judge = True;
# optimal policy
Gambler's Problem
Case

Case analysis
S: the money we have : 0-100, v(0)=0,v(100)=1 (constant)
A: the money we stake: 0-min( s,100-s )
R: no imtermediate reward, expect v(100)=1
: S+A or S-A, which depends on the probability:
(win)
Code
### settings
import math
import numpy
import random
# visualization
import matplotlib
import matplotlib.pyplot as plt
# global settings
MAX_MONEY = 100 ;
MIN_MONEY = 0 ;
P_H = 0.4 ;
gamma = 1 ; # episodic
# accuracy of value function
error = 0.001;
### functions
# max value function
def max_v_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return max(v_a)
# max value function index
def argmax_a(v,s):
MIN_ACTION = 0 ;
MAX_ACTION = min( s, MAX_MONEY-s ) ;
v_a = numpy.zeros(MAX_ACTION+1 - MIN_ACTION ,dtype = numpy.float);
for a in range(MIN_ACTION, MAX_ACTION+1):
v_a [a-MIN_ACTION] = ( P_H*( 0 + gamma*v[s+a] ) + \
(1- P_H)*( 0 + gamma*v[s-a] ) );
return ( numpy.argmax(v_a) )
# visualization
def visualization(v_set,policy):
fig,axes = plt.subplots(2,1)
for i in range(0,len(v_set)):
axes[0].plot(v_set[i],linewidth=3)
# plt.pause(0.5)
axes[0].set_title('value function')
axes[1].plot(range(1,len(policy)+1),policy)
axes[1].set_title('policy')
plt.show()
### main programming
# policy
policy = numpy.zeros(MAX_MONEY-1);
# value function
v = numpy.zeros(MAX_MONEY+1,dtype = numpy.float);
#every_sweep_of value function
v_set = [] ;
# initialization
v[MAX_MONEY] = 1 ;
v[MIN_MONEY] = 0 ;
judge = True;
# value iteration
while(judge):
judge = False
for s in range(MIN_MONEY+1,MAX_MONEY):
v_old = v[s];
v[s] = max_v_a(v,s);
if math.fabs( v[s] - v_old ) > error:
judge = True;
v_set.append(v.copy())
# optimal policy
for s in range(MIN_MONEY+1,MAX_MONEY):
policy[s-1] = argmax_a(v,s)
# visualization
visualization(v_set,policy)
Result
for =0.4

本文介绍了动态规划中的价值迭代算法,并通过赌博者问题进行案例分析。在赌博者问题中,赌徒试图用有限的钱通过掷硬币赢得目标金额。通过价值迭代,我们找到了最优策略,即在每一步选择最大期望收益的投注。文章还提供了Python代码实现,展示价值函数和策略的可视化结果。
1180

被折叠的 条评论
为什么被折叠?



