理解经典强化学习交互环境CartPole_v1,为自己创建实验交互环境建立基础框架。
文章目录
一、环境介绍
1.1 环境概述
Cart Pole即车杆游戏,游戏模型如下图所示。游戏里面有一个小车,上有竖着一根杆子,每次重置后的初始状态会有所不同。小车需要左右移动来保持杆子竖直,为了保证游戏继续进行需要满足以下两个条件:
杆子倾斜的角度必须保持在[-12°,12°]之间;
小车移动的位置需保持在一定范围,[-2.4,2.4]单位长度之间。
1.2 环境参数
【状态-state】
作为强化学习的经典环境,CartPole_v1中,state状态包含4项参数:
观测参数 | 含义 | 最小值 | **最大值 ** |
---|---|---|---|
Cart Position | 小车位置 | -4.8 | 4.8 |
Cart Velocity | 小车速度 | -Inf | Inf |
Pole Angle | 杆角度 | -24° | 24° |
Pole Angular Velocity | 杆角速度 | -Inf | Inf |
(小车位置、杆角度的变化范围设置为判断阈值的2倍)
【动作-action】
CartPole_v1中,action space仅为{0,1},代表force推力,其含义如下。其中,force会影响杆角加速度与小车加速度。
action value | **含义 ** |
---|---|
0 | 负向力(Push cart to the left) |
1 | 正向力(Push cart to the right) |
【奖励-reward】
每坚持一帧,奖励就+1。一帧即代表每进行一次与环境的交互(env.step(state,action)),并且小车位置和杆角度未超过判断阈值,但最大累计奖励为500。
【终止状态设置-termination】
- 终止:极点角度大于±12°
- 终止:推车位置大于±2.4(推车中心到达显示屏边缘)
- 截断:episode长度大于 500(v0 为 200)
因此,该CartPole_v1是一个state连续向量,action离散变量的环境。
二、环境源码
2.1 源码来源
(1)根据gym安装路径寻找cartpole.py文件,路径如下:
~envs\【自定义conda虚拟环境名称】\Lib\site-packages\gym\envs\classic_control\cartpole.py
(2)openai gym github网站,链接如下:
https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
2.2 完整源码
"""
Classic cart-pole system implemented by Rich Sutton et al.
Copied from http://incompleteideas.net/sutton/book/code/pole.c
permalink: https://perma.cc/C9ZM-652R
"""
import math
from typing import Optional, Union
import numpy as np
import gym
from gym import logger, spaces
from gym.envs.classic_control import utils
from gym.error import DependencyNotInstalled
from gym.utils.renderer import Renderer
class CartPoleEnv(gym.Env[np.ndarray, Union[int, np.ndarray]]):
metadata = {
"render_modes": ["human", "rgb_array", "single_rgb_array"],
"render_fps": 50,
}
def __init__(self, render_mode: Optional[str] = None):
self.gravity = 9.8
self.masscart = 1.0
self.masspole = 0.1
self.total_mass = self.masspole + self.masscart
self.length = 0.5 # actually half the pole's length
self.polemass_length = self.masspole * self.length
self.force_mag = 10.0
self.tau = 0.02 # seconds between state updates
self.kinematics_integrator = "euler"
# Angle at which to fail the episode
self.theta_threshold_radians = 12 * 2 * math.pi / 360
self.x_threshold = 2.4
# Angle limit set to 2 * theta_threshold_radians so failing observation
# is still within bounds.
high = np.array(
[
self.x_threshold * 2,
np.finfo(np.float32).max,
self.theta_threshold_radians * 2,
np.finfo(np.float32).max,
],
dtype=np.float32,
)
self.action_space = spaces.Discrete(2)
self.observation_space = spaces.Box(-high, high, dtype=np.float32)
self.render_mode = render_mode
self.renderer = Renderer(self.render_mode, self._render)
self.screen_width = 600
self.screen_height = 400
self.screen = None
self.clock = None
self.isopen = True
self.state = None
self.steps_beyond_terminated = None
def step(self, action):
err_msg = f"{action!r} ({type(action)}) invalid"
assert self.action_space.contains(action), err_msg
assert self.state is not None, "Call reset before using step method."
x, x_dot, theta, theta_dot = self.state
force = self.force_mag if action == 1 else -self.force_mag
costheta = math.cos(theta)
sintheta = math.sin(theta)
# For the interested reader:
# https://coneural.org/florian/papers/05_cart_pole.pdf
temp = (
force + self.polemass_length * theta_dot**2 * sintheta
) / self.total_mass
thetaacc = (self.gravity * sintheta - costheta * temp) / (
self.length * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
)
xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass
if self.kinematics_integrator == "euler":
x = x + self.tau * x_dot
x_dot = x_dot + self.tau * xacc
theta = theta + self.tau * theta_dot
theta_dot = theta_dot + self.tau * thetaacc
else: # semi-implicit euler
x_dot = x_dot + self.tau * xacc
x = x + self.tau * x_dot
theta_dot = theta_dot + self.tau * thetaacc
theta = theta + self.tau * theta_dot
self.state = (x, x_dot, theta, theta_dot)
terminated = bool(
x < -self.x_threshold
or x > self.x_threshold
or theta < -self.theta_threshold_radians
or theta > self.theta_threshold_radians
)
if not terminated:
reward = 1.0
elif self.steps_beyond_terminated is None:
# Pole just fell!
self.steps_beyond_terminated = 0
reward = 1.0
else:
if self.steps_beyond_terminated == 0:
logger.warn(
"You are calling 'step()' even though this "
"environment has already returned terminated = True. You "
"should always call 'reset()' once you receive 'terminated = "
"True' -- any further steps are undefined behavior."
)
self.steps_beyond_terminated += 1
reward = 0.0
self.renderer.render_step()
return np.array(self.state, dtype=np.float32), reward, terminated, False, {}
def reset(
self,
*,
seed: Optional[int] = None,
return_info: bool = False,
options: Optional[dict] = None,
):
super().reset(seed=seed)
# Note that if you use custom reset bounds, it may lead to out-of-bound
# state/observations.
low, high = utils.maybe_parse_reset_bounds(
options, -0.05, 0.05 # default low
) # default high
self.state = self.np_random.uniform(low=low, high=high, size=(4,))
self.steps_beyond_terminated = None
self.renderer.reset()
self.renderer.render_step()
if not return_info:
return np.array(self.state, dtype=np.float32)
else:
return np.array(self.state, dtype=np.float32), {}
def render(self, mode="human"):
if self.render_mode is not None:
return self.renderer.get_renders()
else:
return self._render(mode)
def _render(self, mode="human"):
assert mode in self.metadata["render_modes"]
try:
import pygame
from pygame import gfxdraw
except ImportError:
raise DependencyNotInstalled(
"pygame is not installed, run `pip install gym[classic_control]`"
)
if self.screen is None:
pygame.init()
if mode == "human":
pygame.display.init()
self.screen = pygame.display.set_mode(
(self.screen_width, self.screen_height)
)
else: # mode in {"rgb_array", "single_rgb_array"}
self.screen = pygame.Surface((self.screen_width, self.screen_height))
if self.clock is None:
self.clock = pygame.time.Clock()
world_width = self.x_threshold * 2
scale = self.screen_width / world_width
polewidth = 10.0
polelen = scale * (2 * self.length)
cartwidth = 50.0
cartheight = 30.0
if self.state is None:
return None
x = self.state
self.surf = pygame.Surface((self.screen_width, self.screen_height))
self.surf.fill((255, 255, 255))
l, r, t, b = -cartwidth / 2, cartwidth / 2, cartheight / 2, -cartheight / 2
axleoffset = cartheight / 4.0
cartx = x[0] * scale + self.screen_width / 2.0 # MIDDLE OF CART
carty = 100 # TOP OF CART
cart_coords = [(l, b), (l, t), (r, t), (r, b)]
cart_coords = [(c[0] + cartx, c[1] + carty) for c in cart_coords]
gfxdraw.aapolygon(self.surf, cart_coords, (0, 0, 0))
gfxdraw.filled_polygon(self.surf, cart_coords, (0, 0, 0))
l, r, t, b = (
-polewidth / 2,
polewidth / 2,
polelen - polewidth / 2,
-polewidth / 2,
)
pole_coords = []
for coord in [(l, b), (l, t), (r, t), (r, b)]:
coord = pygame.math.Vector2(coord).rotate_rad(-x[2])
coord = (coord[0] + cartx, coord[1] + carty + axleoffset)
pole_coords.append(coord)
gfxdraw.aapolygon(self.surf, pole_coords, (202, 152, 101))
gfxdraw.filled_polygon(self.surf, pole_coords, (202, 152, 101))
gfxdraw.aacircle(
self.surf,
int(cartx),
int(carty + axleoffset),
int(polewidth / 2),
(129, 132, 203),
)
gfxdraw.filled_circle(
self.surf,
int(cartx),
int(carty + axleoffset),
int(polewidth / 2),
(129, 132, 203),
)
gfxdraw.hline(self.surf, 0, self.screen_width, carty, (0, 0, 0))
self.surf = pygame.transform.flip(self.surf, False, True)
self.screen.blit(self.surf, (0, 0))
if mode == "human":
pygame.event.pump()
self.clock.tick(self.metadata["render_fps"])
pygame.display.flip()
elif mode in {"rgb_array", "single_rgb_array"}:
return np.transpose(
np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2)
)
def close(self):
if self.screen is not None:
import pygame
pygame.display.quit()
pygame.quit()
self.isopen = False
2.3 源码详解
由于目标是“自建环境”的需求,因此不关心可视化的内容,因此重点关注在源码中参与环境交互的函数。
(1)inti()
初始化函数主要定义一些常量,用于参与物理模型的计算和阈值判断。
【关键代码及注释】
self.gravity = 9.8 #重力加速度g
self.masscart = 1.0 #车重量
self.masspole = 0.1 #杆重量
self.total_mass = (self.masspole + self.masscart) #(0.1 + 1) = 1.1,车和杆的重量
self.length = 0.5 # 计算杆高度用
self.polemass_length = (self.masspole * self.length) #(0.1 * 0.5) = 0.05,计算加速度用
self.force_mag = 10.0 #输入动作,每次施加给车的力
self.tau = 0.02 # 状态state更新之间的秒数
self.kinematics_integrator = 'euler' #运动学积分器,为了计算下一步state
# Angle at which to fail the episode 本局失败的角度
self.theta_threshold_radians = 12 * 2 * math.pi / 360 # 12 * 2 * 180° / 360° = 12 °
self.x_threshold = 2.4
(2)reset()
【关键代码】
low, high = utils.maybe_parse_reset_bounds(
options, -0.05, 0.05 # default low
) # default high
self.state = self.np_random.uniform(low=low, high=high, size=(4,))
self.steps_beyond_terminated = None
指定state 4项参数的上下界范围为(-0.05,0.05);
通过随机数的方式赋值state 4项参数,基于上下界范围;
重置 steps_beyond_terminated 属性为 None,该属性用于追踪在环境终止后还进行了多少步操作,这是一个安全措施,防止在环境已经终止的情况下继续执行步骤。
(3)step(self, action)
①获取当前state,action
x, x_dot, theta, theta_dot = self.state
force = self.force_mag if action == 1 else -self.force_mag
②根据当前state中角度、角速度计算杆角速度thetaacc和小车加速度xacc
costheta = math.cos(theta)
sintheta = math.sin(theta)
temp = (
force + self.polemass_length * theta_dot**2 * sintheta
) / self.total_mass
thetaacc = (self.gravity * sintheta - costheta * temp) / (
self.length * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
)
xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass
③根据设置的积分器euler更新state的4项参数
if self.kinematics_integrator == "euler":
x = x + self.tau * x_dot
x_dot = x_dot + self.tau * xacc
theta = theta + self.tau * theta_dot
theta_dot = theta_dot + self.tau * thetaacc
else: # semi-implicit euler
x_dot = x_dot + self.tau * xacc
x = x + self.tau * x_dot
theta_dot = theta_dot + self.tau * thetaacc
theta = theta + self.tau * theta_dot
self.state = (x, x_dot, theta, theta_dot)
④检查小车的位置或杆的角度是否超出了预设的阈值,以确定是否达到了终止条件
terminated = bool(
x < -self.x_threshold
or x > self.x_threshold
or theta < -self.theta_threshold_radians
or theta > self.theta_threshold_radians
)
if not terminated:
reward = 1.0
elif self.steps_beyond_terminated is None:
# Pole just fell!
self.steps_beyond_terminated = 0
reward = 1.0
else:
if self.steps_beyond_terminated == 0:
logger.warn(
"You are calling 'step()' even though this "
"environment has already returned terminated = True. You "
"should always call 'reset()' once you receive 'terminated = "
"True' -- any further steps are undefined behavior."
)
self.steps_beyond_terminated += 1
reward = 0.0
⑤反馈环境单步交互情况
return np.array(self.state, dtype=np.float32), reward, terminated, False, {}
三、总结
CartPole_v1的源码还是比较简单,容易理解的,但基于“自建环境”的需求,还是稍微小结一下,在此之前个人认为应该做好如下准备:
- 确定算法框架:先从网上示例程序出发,以简单的交互环境(如CartPole)跑通算法(如A2C、PPO等),然后以该算法为基础确定与环境交互的部分有哪些。(若是只有数据,无需仅需环境交互,就采用离线RL)
- 马尔可夫建模:将实验过程基于MDP过程建模,将状态空间、动作空间、奖励设置、终止状态定义好,若发现不满足“序贯决策”特性,则考虑RL之外的其它方法。
- 确定计算模型(most important):在CartPole中与环境交互的关键函数为env.step(self,action),其中基于当前state和积分器euler得以更新next state,在这过程中最关键的就是小车加速度、杆角加速度的计算模型。因此,环境交互本质就是“更新state及获取即时rewad”,最为关键的也就需要确定实验过程中应用于更新state的计算模型。
- 设计交互函数:完成上述步骤后,其实就可以基于上述约束,以简单的“环境框架”(如CartPole)去自建自己实验的交互环境了。