基于TF-Agent的回合策略梯度算法模型训练Atari游戏

最新推荐文章于 2025-02-19 09:06:51 发布

gzroy

最新推荐文章于 2025-02-19 09:06:51 发布

阅读量3.1k

点赞数

分类专栏： Python编程人工智能机器学习文章标签：算法 tensorflow 深度学习

本文链接：https://blog.youkuaiyun.com/gzroy/article/details/122357452

版权

人工智能同时被 3 个专栏收录

43 篇文章

订阅专栏

机器学习

41 篇文章

订阅专栏

Python编程

25 篇文章

订阅专栏

在上一篇博客中，我用Tensorflow的Agent库的DQN模型来对Atari的PONG游戏进行训练，效果很好。这次我打算测试一下回合策略梯度模型，看是否也能取得相同的效果。关于回合策略梯度算法的介绍，可以见我之前的另一篇博客强化学习笔记(5)-回合策略梯度算法_gzroy的博客-优快云博客

在TF-Agent里面，有一个ReinforceAgent，实现了回合策略梯度算法。这个Agent需要构建一个Actor Network，通过输入环境的观察，得到动作的分布值 $h(s,a;\theta)$ ，对这个分布值进行Softmax计算就能得到 $\pi(a|s;\theta)$ ，即每个动作的概率。定义损失函数为 $-\gamma^{t}G ln\pi(A_{t}|S_{t};\theta)$ ，t为回合中的某一步，不断优化 $\theta$ 以减小loss值，就能使得回合的预期回报增加，从而找到最优的策略。

首先加载Atari游戏的环境，如以下代码：

from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.trajectories import trajectory, Trajectory, time_step, TimeStep
from tf_agents.specs import tensor_spec
from tqdm import trange
import tensorflow as tf
from tensorflow import keras
from tf_agents.agents import ReinforceAgent
from tf_agents.utils import common
from tf_agents.networks.actor_distribution_network import ActorDistributionNetwork
import random
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib notebook
import os

env_name = 'PongDeterministic-v4'
train_py_env = suite_gym.load(env_name, max_episode_steps=0)
eval_py_env = suite_gym.load(env_name, max_episode_steps=0)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

因为我们需要获取环境最近4帧的图像来作为输入，需要修改以下环境的默认的Timestep的spec，每帧图像的大小是220*168*3，压缩大小并转换为黑白，然后把这4帧图像叠加，因此调整后的输入是110*84*4。如以下代码：

input_shape = (110, 84, 4)
time_step_spec = train_env.time_step_spec()
new_observation_spec = tensor_spec.BoundedTensorSpec(input_shape, tf.uint8, 0, 255, 'observation')
new_time_step_spec = time_step.TimeStep(
    time_step_spec.step_type, 
    time_step_spec.reward, 
    time_step_spec.discount, 
    new_observation_spec)

定义一个ReinforceAgent，在这个Agent里面定义一个神经网络，里面有三个卷积层，把输入的观测的图像数据提取图像特征，然后通过一个全连接层输出动作的概率。

gamma = 0.99
learning_rate = 0.0005 

actor_net = ActorDistributionNetwork(
    new_observation_spec,
    train_env.action_spec(),
    preprocessing_layers=tf.keras.layers.Rescaling(scale=1./127.5, offset=-1),
    conv_layer_params=[(32,8,4), (64,4,2), (64,3,1)],
    fc_layer_params=(512,))

global_step = tf.compat.v1.train.get_or_create_global_step()

agent = ReinforceAgent(
    new_time_step_spec,
    train_env.action_spec(),
    actor_network=actor_net,
    gamma=gamma,
    optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
    train_step_counter=global_step)

agent.initialize()
action_min = agent.collect_data_spec.action.minimum
action_max = agent.collect_data_spec.action.maximum

定义一个辅助函数，用于把环境每一步得到的图像，进行图像大小和黑白转换之后，和之前的三帧图像叠加在一起：

def get_observation(images, observation):
    image = tf.squeeze(observation)
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image, [input_shape[0],input_shape[1]])
    image = tf.cast(image, tf.uint8)
    image = tf.squeeze(image)
    if len(images)==0:
        images = [image, image, image, image]
    images = images[1:]
    images.append(image)
    observation = tf.stack(images)
    observation = tf.transpose(observation, perm=[1,2,0])
    return images, observation

定义一个画图的函数，把训练过程中记录的Loss值和回合的回报值展示出来：

class Chart:
    def __init__(self):
        self.fig, self.ax = plt.subplots(figsize = (8, 6))
 
    def plot(self, data, x_name, y_name, hue_name):
        self.ax.clear()
        sns.lineplot(data=data, x=data[x_name], y=data[y_name], hue=data[hue_name], ax=self.ax)
        self.fig.canvas.draw()

chart_reward = Chart()
chart_loss = Chart()

因为整个训练过程非常久，需要定义checkpointer，定期保存模型的参数：

checkpoint_dir = os.path.join('./', 'checkpoint')
train_checkpointer = common.Checkpointer(
    ckpt_dir=checkpoint_dir,
    max_to_keep=1,
    agent=agent,
    policy=agent.policy,
    global_step=global_step
)

#if continue training, load the checkpointer
if continue_training:
    train_checkpointer.initialize_or_restore()
    global_step = tf.compat.v1.train.get_global_step()
    rewards_df = pd.read_csv('rewards_df.csv')
    loss_df = pd.read_csv('loss_df.csv')
else:
    rewards_df = pd.DataFrame(data=None, columns=['step','reward','type'])
    loss_df = pd.DataFrame(data=None, columns=['step','loss','type'])

然后我们就可以进行训练了，训练的时候每个Iteration都会先用模型来玩一个回合，搜集轨迹数据，把这些轨迹数据封装为一个Trajectory，然后调用ReinforceAgent的Train方法来进行训练。每100个迭代我们会输出Loss值和回合的回报值，以观察训练效果：

agent.train = common.function(agent.train)
num_iterations = 1000

for _ in trange(num_iterations):
    total_loss = 0
    time_step = train_env.reset()
    images = []
    observations = []
    actions = []
    policy_infos = []
    rewards = []
    discounts = []
    episode_reward = 0
    while not time_step.is_last():
        images, observation = get_observation(images, time_step.observation)
        step_type = tf.squeeze(time_step.step_type)
        observations.append(observation)
        time_step = TimeStep(time_step.step_type, time_step.reward, time_step.discount, tf.expand_dims(observation, axis=0))
        action = tf.squeeze(agent.policy.action(time_step).action)
        actions.append(action)
        time_step = train_env.step(action)
        next_step_type = tf.squeeze(time_step.step_type)
        reward = tf.squeeze(time_step.reward)
        rewards.append(reward)
        discount = tf.squeeze(time_step.discount)
        discounts.append(discount)
        episode_reward += reward.numpy()
    observation_t = tf.stack(observations)
    action_t = tf.stack(actions)
    reward_t = tf.stack(rewards)
    discount_t = tf.stack(discounts)
    traj = trajectory.from_episode(observation=observation_t, action=action_t, reward=reward_t, discount=discount_t, policy_info=reward_t)
    batch = tf.nest.map_structure(lambda t: tf.expand_dims(t, 0), traj)
    train_loss = agent.train(batch).loss
    total_loss += train_loss
    if (global_step%100)==0:
        loss_df = loss_df.append({'step':global_step.numpy(), 'loss':total_loss.numpy()/100, 'type':'train'}, ignore_index=True)
        chart_loss.plot(loss_df, 'step', 'loss', 'type')    
        rewards_df = rewards_df.append({'step':global_step.numpy(), 'reward':episode_reward, 'type':'train'}, ignore_index=True)   
        chart_reward.plot(rewards_df, 'step', 'reward', 'type')
        train_checkpointer.save(global_step)