欢迎来到强化学习系列教程的第2部分,特别是Q-Learning。我们已经建立了q表它包含了所有可能的离散状态。接下来,我们需要一种方法来更新q值(每个可能动作每个唯一状态的值),这让我们:
如果你像我一样,这样的数学公式会让你晕头转向。以下是代码公式:
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
这对我来说更清楚一点!现在我们唯一不知道它们来自何方的是:
DISCOUNT
max_future_q
DISCOUNT
是衡量我们有多关心未来的回报而不是眼前的回报。通常,这个值相当高,介于0和1之间。我们希望它高,因为Q学习的目的确实是学习一系列事件,以一个积极的结果结束,所以很自然,我们更重视长期收益而不是短期收益。
max_future_q
是在我们已经执行了操作之后捕获的,然后我们根据下一步的最佳Q值部分更新之前的值。随着时间的推移,一旦我们达到了目标,这种“奖励”价值就会慢慢地在每一集中一步一步地向后传播。超级基本的概念,但它的工作方式相当整洁!
好了,现在,我们已经知道了所有我们需要知道的东西。它实际上没有多少“算法”代码,我们只需要编写周围的逻辑。
我们将从以下脚本开始:
import gym
import numpy as np
env = gym.make("MountainCar-v0")
env.reset()
# 将状态值变为离散的
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE
# 随机初始化一个Q表
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))
done = False
while not done:
action = 2
new_state, reward, done, _ = env.step(action)
print(reward, new_state)
env.render()
#new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
现在,让我们添加更多的常量:
# Q-Learning settings
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000
LEARNING_RATE
在0和1之间,DISCOUNT
也是如此。EPISODES
就是我们想要运行游戏的多少次迭代。
接下来,我们需要一个快速的辅助函数,它将转换我们的环境“状态”,它目前包含连续的值,这将使我们的q表绝对庞大,并需要永远学习…改为“离散”状态:
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(np.int))
完整的代码到这一点:
import gym
import numpy as np
env = gym.make("MountainCar-v0")
env.reset()
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(np.int)) # we use this tuple to look up the 3 Q values for the available actions in the q-table
done = False
while not done:
action = 2
new_state, reward, done, _ = env.step(action)
print(reward, new_state)
env.render()
#new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
现在,我们已经准备好了实际执行这个环境,首先,让我们把env.reset()
移动到while循环的正前面,同时获取初始状态值:
discrete_state = get_discrete_state(env.reset())
done = False
while not done:
...
接下来,我们将action = 2
替换为:
# 找到了最大的那个action
action = np.argmax(q_table[discrete_state])
然后,我们想要获得新的离散状态:
new_discrete_state = get_discrete_state(new_state)
现在,我们要更新q值。注意,我们正在更新我们已经创建的操作的Q值
# If simulation did not end yet after last step - update Q table
if not done:
# Maximum possible Q value in next step (for new state)
max_future_q = np.max(q_table[new_discrete_state])
# Current Q value (for current state and performed action)
current_q = q_table[discrete_state + (action,)]
# And here's our equation for a new Q value for current state and action
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# Update Q table with new Q value
q_table[discrete_state + (action,)] = new_q
如果模拟结束了,我们想看看是否因为我们达到了目标:
# Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
elif new_state[0] >= env.goal_position:
#q_table[discrete_state + (action,)] = reward
q_table[discrete_state + (action,)] = 0
现在,我们需要重置discrete_state
变量:
discrete_state = new_discrete_state
最后,我们将以一个结束我们的代码:
env.close()
完整的代码到这一点:
import gym
import numpy as np
env = gym.make("MountainCar-v0")
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(np.int)) # we use this tuple to look up the 3 Q values for the available actions in the q-table
discrete_state = get_discrete_state(env.reset())
done = False
while not done:
action = np.argmax(q_table[discrete_state])
new_state, reward, done, _ = env.step(action)
new_discrete_state = get_discrete_state(new_state)
env.render()
#new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# If simulation did not end yet after last step - update Q table
if not done:
# Maximum possible Q value in next step (for new state)
max_future_q = np.max(q_table[new_discrete_state])
# Current Q value (for current state and performed action)
current_q = q_table[discrete_state + (action,)]
# And here's our equation for a new Q value for current state and action
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# Update Q table with new Q value
q_table[discrete_state + (action,)] = new_q
# Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
elif new_state[0] >= env.goal_position:
#q_table[discrete_state + (action,)] = reward
q_table[discrete_state + (action,)] = 0
discrete_state = new_discrete_state
env.close()
好吧,你应该看一集。没什么特别的。我们需要很多集来训练。现在,让我们为剧集添加循环:
for episode in range(EPISODES):
discrete_state = get_discrete_state(env.reset())
done = False
while not done:
action = np.argmax(q_table[discrete_state])
new_state, reward, done, _ = env.step(action)
...
然后,我们现在想要移除渲染环境。我们需要运行数千次迭代,而渲染环境将使其花费更长的时间。我们可以每n步检查一次环境,或者在模型学习了一些东西之后。例如,我们添加如下常数:
SHOW_EVERY = 1000
然后,在循环代码中:
for episode in range(EPISODES):
discrete_state = get_discrete_state(env.reset())
done = False
if episode % SHOW_EVERY == 0:
render = True
print(episode)
else:
render = False
while not done:
action = np.argmax(q_table[discrete_state])
new_state, reward, done, _ = env.step(action)
new_discrete_state = get_discrete_state(new_state)
if episode % SHOW_EVERY == 0:
env.render()
好的,很好,但是看起来模型并没有学习。为什么不呢?
所以q表是随机初始化的。然后,每走一步,代理都会得到-1。唯一的一次,代理得到回报,嗯,没有(0)…就是他们是否达到目标。我们需要代理在一段时间内达到目标。如果它们只接触过一次,它们会更有可能再接触一次,因为回报会往回传播。当他们更有可能达到目标时,他们就会一次又一次地达到目标……然后,他们就会明白了。但是,我们第一次是怎么到达那里的呢?
ε!
或者,用普通人的话说:随机移动。
当一个Agent学习一个环境时,它会从“探索”转向“利用”。现在,我们的模型是贪婪的,总是利用最大的Q值…但是这些Q值现在毫无价值。我们需要探员去探索!
为此,我们将添加以下值:
# Exploration settings
epsilon = 1 # not a constant, qoing to be decayed
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES//2
epsilon_decay_value = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)
现在,我们希望每一集都对值进行衰减,直到衰减结束。我们会在每集的最后(基本上在最底部)做这个:
...
# Decaying is being done every episode if episode number is within decaying range
if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
epsilon -= epsilon_decay_value
env.close()
现在我们只需要用。我们将使用np.random.random()随机选择一个从0到1的数字。如果np.random.random()大于epsilon值,那么我们还是按照q的最大值来计算。否则,我们将随机移动:
while not done:
if np.random.random() > epsilon:
# Get action from Q table
action = np.argmax(q_table[discrete_state])
else:
# Get random action
action = np.random.randint(0, env.action_space.n)
完整的代码到这一点:
# objective is to get the cart to the flag.
# for now, let's just move randomly:
import gym
import numpy as np
env = gym.make("MountainCar-v0")
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 25000
SHOW_EVERY = 3000
DISCRETE_OS_SIZE = [20, 20]
discrete_os_win_size = (env.observation_space.high - env.observation_space.low)/DISCRETE_OS_SIZE
# Exploration settings
epsilon = 1 # not a constant, qoing to be decayed
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES//2
epsilon_decay_value = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)
q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))
def get_discrete_state(state):
discrete_state = (state - env.observation_space.low)/discrete_os_win_size
return tuple(discrete_state.astype(np.int)) # we use this tuple to look up the 3 Q values for the available actions in the q-table
for episode in range(EPISODES):
discrete_state = get_discrete_state(env.reset())
done = False
if episode % SHOW_EVERY == 0:
render = True
print(episode)
else:
render = False
while not done:
if np.random.random() > epsilon:
# Get action from Q table
action = np.argmax(q_table[discrete_state])
else:
# Get random action
action = np.random.randint(0, env.action_space.n)
new_state, reward, done, _ = env.step(action)
new_discrete_state = get_discrete_state(new_state)
if episode % SHOW_EVERY == 0:
env.render()
#new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# If simulation did not end yet after last step - update Q table
if not done:
# Maximum possible Q value in next step (for new state)
max_future_q = np.max(q_table[new_discrete_state])
# Current Q value (for current state and performed action)
current_q = q_table[discrete_state + (action,)]
# And here's our equation for a new Q value for current state and action
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
# Update Q table with new Q value
q_table[discrete_state + (action,)] = new_q
# Simulation ended (for any reson) - if goal position is achived - update Q value with reward directly
elif new_state[0] >= env.goal_position:
#q_table[discrete_state + (action,)] = reward
q_table[discrete_state + (action,)] = 0
discrete_state = new_discrete_state
# Decaying is being done every episode if episode number is within decaying range
if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
epsilon -= epsilon_decay_value
env.close()