用蒙特卡罗方法来实现策略评估,即给定一个策略,找出价值函数。
重点是:
for _ in range(100):
# generate an episode using pi
states, rewards = play_game(grid, policy)
G = 0
T = len(states)
用蒙特卡罗方法来实现策略评估,即给定一个策略,找出价值函数。
重点是:
for _ in range(100):
# generate an episode using pi
states, rewards = play_game(grid, policy)
G = 0
T = len(states)