【强化学习】Q-learning

本文通过Python实现Q学习算法解决Taxi-v1环境问题,详细展示了如何使用epsilon-greedy策略进行智能体训练,并通过图表展示800轮训练后奖励的变化趋势。

在这里插入图片描述
在这里插入图片描述

import random
import matplotlib.pylab as plt
#%matplotlib inline
import gym

env = gym.make('Taxi-v1')
env.render()
print(env.observation_space.n)
print(env.action_space.n)

在这里插入图片描述

500
6
q = {}
for s in range(env.observation_space.n):
    for a in range(env.action_space.n):
        q[(s,a)] = 0.0
        
def update_q_table(prev_state, action, reward, nextstate, alpha, gamma):
    qa = max([q[(nextstate, a)] for a in range(env.action_space.n)])
    q[(prev_state,action)] += alpha * (reward + gamma * qa - q[(prev_state,action)])

def epsilon_greedy_policy(state, epsilon):
    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: q[(state,x)])

alpha = 0.4
gamma = 0.999
epsilon = 0.017
rewards = []
for i in range(800):
    r = 0
    
    prev_state = env.reset()
    
    while True: 
        
        env.render()
        
        # In each state, we select the action by epsilon-greedy policy
        action = epsilon_greedy_policy(prev_state, epsilon)
        
        # then we perform the action and move to the next state, and receive the reward
        nextstate, reward, done, _ = env.step(action)
        
        # Next we update the Q value using our update_q_table function
        # which updates the Q value by Q learning update rule
        
        update_q_table(prev_state, action, reward, nextstate, alpha, gamma)
        
        # Finally we update the previous state as next state
        prev_state = nextstate

        # Store all the rewards obtained
        r += reward

        #we will break the loop, if we are at the terminal state of the episode
        if done:
            break

    #print("total reward: ", r)
    rewards.append(r)

env.close()
plt.plot(rewards)
plt.show()

从 800 个 episode 训练的结果来看,agent 已经学到了合理的策略:reward 从负到正。实际上,400次试验后已经稳定。
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

颹蕭蕭

白嫖?

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值