Q-learning 比 Sarsa 更勇敢(off-policy)探索性更强,直接找 reward 最大的方向,不怕进坑里,而 Sarsa 在不断尝试中保全自身,不断少踩地雷。
Q-learning 是先用新的状态能选择的最大Q值,更新q表后选择动作,更新q表时不考虑a_
Sarsa 是在更新q表之前就选择好了动作
1、maze_env.py
import numpy as np
import time
import sys
if sys.version_info.major == 2:
import Tkinter as tk
else:
import tkinter as tk
UNIT = 40 # pixels
MAZE_H = 4 # grid height
MAZE_W = 4 # grid width
class Maze(tk.Tk, object):
def __init__(self):
super(Maze, self).__init__()
self.action_space = ['u', 'd', 'l', 'r']
self.n_actions = len(self.action_space)
self.title('maze')
self.geometry('{0}x{1}'.format(MAZE_W * UNIT, MAZE_H * UNIT))
self._build_maze()
def _build_maze(self):
self.canvas &