在上一篇博客中,我用Tensorflow的Agent库的DQN模型来对Atari的PONG游戏进行训练,效果很好。这次我打算测试一下回合策略梯度模型,看是否也能取得相同的效果。关于回合策略梯度算法的介绍,可以见我之前的另一篇博客强化学习笔记(5)-回合策略梯度算法_gzroy的博客-优快云博客
在TF-Agent里面,有一个ReinforceAgent,实现了回合策略梯度算法。这个Agent需要构建一个Actor Network,通过输入环境的观察,得到动作的分布值,对这个分布值进行Softmax计算就能得到
,即每个动作的概率。定义损失函数为
,t为回合中的某一步,不断优化
以减小loss值,就能使得回合的预期回报增加,从而找到最优的策略。
首先加载Atari游戏的环境,如以下代码:
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.trajectories import trajectory, Trajectory, time_step, TimeStep
from tf_agents.specs import tensor_spec
from tqdm import trange
import tensorflow as tf
from tensorflow import keras
from tf_agents.agents import ReinforceAgent
from tf_agents.utils import common
from tf_agents.networks.actor_distribution_network import ActorDistributionNetwork
import random
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib notebook
import os
env_name = 'PongDeterministic-v4'
train_py_env = suite_gym.load(env_name, max_episode_steps=0)
eval_py_env = suite_gym.load(env_name, max_episode_steps=0)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)
因为我们需要获取环境最近4帧的图像来作为输入,需要修改以下环境的默认的Timestep的spec,每帧图像的大小是220*168*3,压缩大小并转换为黑白,然后把这4帧图像叠加,因此调整后的输入是110*84*4。如以下代码:
input_shape = (110, 84, 4)
time_step_spec = train_env.time_step_spec()
new_observation_spec = tensor_spec.BoundedTensorSpec(input_shape, tf.uint8, 0, 255, 'observation')
new_time_step_spec = time_step.TimeStep(
time_step_spec.step_type,
time_step_spec.reward,
time_step_spec.discount,
new_observation_spec)
定义一个ReinforceAgent,在这个Agent里面定义一个神经网络,里面有三个卷积层,把输入的观测的图像数据提取图像特征,然后通过一个全连接层输出动作的概率。
gamma = 0.99
learning_rate = 0.0005
actor_net = ActorDistributionNetwork(
new_observation_spec,
train_env.action_spec(),
preprocessing_layers=tf.keras.layers.Rescaling(scale=1./127.5, offset=-1),
conv_layer_params=[(32,8,4), (64,4,2), (64,3,1)],
fc_layer_params=(512,))
global_step = tf.compat.v1.train.get_or_create_global_step()
agent = ReinforceAgent(
new_time_step_spec,
train_env.action_spec(),
actor_network=actor_net,
gamma=gamma,
optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
train_step_counter=global_step)
agent.initialize()
action_min = agent.collect_data_spec.action.minimum
action_max = agent.collect_data_spec.action.maximum
定义一个辅助函数,用于把环境每一步得到的图像,进行图像大小和黑白转换之后,和之前的三帧图像叠加在一起:
def get_observation(images, observation):
image = tf.squeeze(observation)
image = tf.image.rgb_to_grayscale(image)
image = tf.image.resize(image, [input_shape[0],input_shape[1]])
image = tf.cast(image, tf.uint8)
image = tf.squeeze(image)
if len(images)==0:
images = [image, image, image, image]
images = images[1:]
images.append(image)
observation = tf.stack(images)
observation = tf.transpose(observation, perm=[1,2,0])
return images, observation
定义一个画图的函数,把训练过程中记录的Loss值和回合的回报值展示出来:
class Chart:
def __init__(self):
self.fig, self.ax = plt.subplots(figsize = (8, 6))
def plot(self, data, x_name, y_name, hue_name):
self.ax.clear()
sns.lineplot(data=data, x=data[x_name], y=data[y_name], hue=data[hue_name], ax=self.ax)
self.fig.canvas.draw()
chart_reward = Chart()
chart_loss = Chart()
因为整个训练过程非常久,需要定义checkpointer,定期保存模型的参数:
checkpoint_dir = os.path.join('./', 'checkpoint')
train_checkpointer = common.Checkpointer(
ckpt_dir=checkpoint_dir,
max_to_keep=1,
agent=agent,
policy=agent.policy,
global_step=global_step
)
#if continue training, load the checkpointer
if continue_training:
train_checkpointer.initialize_or_restore()
global_step = tf.compat.v1.train.get_global_step()
rewards_df = pd.read_csv('rewards_df.csv')
loss_df = pd.read_csv('loss_df.csv')
else:
rewards_df = pd.DataFrame(data=None, columns=['step','reward','type'])
loss_df = pd.DataFrame(data=None, columns=['step','loss','type'])
然后我们就可以进行训练了,训练的时候每个Iteration都会先用模型来玩一个回合,搜集轨迹数据,把这些轨迹数据封装为一个Trajectory,然后调用ReinforceAgent的Train方法来进行训练。每100个迭代我们会输出Loss值和回合的回报值,以观察训练效果:
agent.train = common.function(agent.train)
num_iterations = 1000
for _ in trange(num_iterations):
total_loss = 0
time_step = train_env.reset()
images = []
observations = []
actions = []
policy_infos = []
rewards = []
discounts = []
episode_reward = 0
while not time_step.is_last():
images, observation = get_observation(images, time_step.observation)
step_type = tf.squeeze(time_step.step_type)
observations.append(observation)
time_step = TimeStep(time_step.step_type, time_step.reward, time_step.discount, tf.expand_dims(observation, axis=0))
action = tf.squeeze(agent.policy.action(time_step).action)
actions.append(action)
time_step = train_env.step(action)
next_step_type = tf.squeeze(time_step.step_type)
reward = tf.squeeze(time_step.reward)
rewards.append(reward)
discount = tf.squeeze(time_step.discount)
discounts.append(discount)
episode_reward += reward.numpy()
observation_t = tf.stack(observations)
action_t = tf.stack(actions)
reward_t = tf.stack(rewards)
discount_t = tf.stack(discounts)
traj = trajectory.from_episode(observation=observation_t, action=action_t, reward=reward_t, discount=discount_t, policy_info=reward_t)
batch = tf.nest.map_structure(lambda t: tf.expand_dims(t, 0), traj)
train_loss = agent.train(batch).loss
total_loss += train_loss
if (global_step%100)==0:
loss_df = loss_df.append({'step':global_step.numpy(), 'loss':total_loss.numpy()/100, 'type':'train'}, ignore_index=True)
chart_loss.plot(loss_df, 'step', 'loss', 'type')
rewards_df = rewards_df.append({'step':global_step.numpy(), 'reward':episode_reward, 'type':'train'}, ignore_index=True)
chart_reward.plot(rewards_df, 'step', 'reward', 'type')
train_checkpointer.save(global_step)