【动手学强化学习】番外1-CartPole_v1环境源码详解及自建环境建议

理解经典强化学习交互环境CartPole_v1,为自己创建实验交互环境建立基础框架。


一、环境介绍

1.1 环境概述

Cart Pole即车杆游戏,游戏模型如下图所示。游戏里面有一个小车,上有竖着一根杆子,每次重置后的初始状态会有所不同。小车需要左右移动来保持杆子竖直,为了保证游戏继续进行需要满足以下两个条件:

杆子倾斜的角度必须保持在[-12°,12°]之间;
小车移动的位置需保持在一定范围,[-2.4,2.4]单位长度之间。

在这里插入图片描述

1.2 环境参数

【状态-state】
作为强化学习的经典环境,CartPole_v1中,state状态包含4项参数:

观测参数含义最小值**最大值 **
Cart Position小车位置-4.84.8
Cart Velocity小车速度-InfInf
Pole Angle杆角度-24°24°
Pole Angular Velocity杆角速度-InfInf

(小车位置、杆角度的变化范围设置为判断阈值的2倍)

【动作-action】
CartPole_v1中,action space仅为{0,1},代表force推力,其含义如下。其中,force会影响杆角加速度与小车加速度。

action value**含义 **
0负向力(Push cart to the left)
1正向力(Push cart to the right)

【奖励-reward】
每坚持一帧,奖励就+1。一帧即代表每进行一次与环境的交互(env.step(state,action)),并且小车位置和杆角度未超过判断阈值,但最大累计奖励为500。

【终止状态设置-termination】

  1. 终止:极点角度大于±12°
  2. 终止:推车位置大于±2.4(推车中心到达显示屏边缘)
  3. 截断:episode长度大于 500(v0 为 200)

因此,该CartPole_v1是一个state连续向量,action离散变量的环境

二、环境源码

2.1 源码来源

(1)根据gym安装路径寻找cartpole.py文件,路径如下:

~envs\【自定义conda虚拟环境名称】\Lib\site-packages\gym\envs\classic_control\cartpole.py

(2)openai gym github网站,链接如下:
https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

2.2 完整源码

"""
Classic cart-pole system implemented by Rich Sutton et al.
Copied from http://incompleteideas.net/sutton/book/code/pole.c
permalink: https://perma.cc/C9ZM-652R
"""
import math
from typing import Optional, Union

import numpy as np

import gym
from gym import logger, spaces
from gym.envs.classic_control import utils
from gym.error import DependencyNotInstalled
from gym.utils.renderer import Renderer


class CartPoleEnv(gym.Env[np.ndarray, Union[int, np.ndarray]]):
    
    metadata = {
        "render_modes": ["human", "rgb_array", "single_rgb_array"],
        "render_fps": 50,
    }

    def __init__(self, render_mode: Optional[str] = None):
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = self.masspole + self.masscart
        self.length = 0.5  # actually half the pole's length
        self.polemass_length = self.masspole * self.length
        self.force_mag = 10.0
        self.tau = 0.02  # seconds between state updates
        self.kinematics_integrator = "euler"

        # Angle at which to fail the episode
        self.theta_threshold_radians = 12 * 2 * math.pi / 360
        self.x_threshold = 2.4

        # Angle limit set to 2 * theta_threshold_radians so failing observation
        # is still within bounds.
        high = np.array(
            [
                self.x_threshold * 2,
                np.finfo(np.float32).max,
                self.theta_threshold_radians * 2,
                np.finfo(np.float32).max,
            ],
            dtype=np.float32,
        )

        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Box(-high, high, dtype=np.float32)

        self.render_mode = render_mode
        self.renderer = Renderer(self.render_mode, self._render)

        self.screen_width = 600
        self.screen_height = 400
        self.screen = None
        self.clock = None
        self.isopen = True
        self.state = None

        self.steps_beyond_terminated = None

    def step(self, action):
        err_msg = f"{action!r} ({type(action)}) invalid"
        assert self.action_space.contains(action), err_msg
        assert self.state is not None, "Call reset before using step method."
        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action == 1 else -self.force_mag
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        # For the interested reader:
        # https://coneural.org/florian/papers/05_cart_pole.pdf
        temp = (
            force + self.polemass_length * theta_dot**2 * sintheta
        ) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (
            self.length * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
        )
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

        if self.kinematics_integrator == "euler":
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        self.state = (x, x_dot, theta, theta_dot)

        terminated = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not terminated:
            reward = 1.0
        elif self.steps_beyond_terminated is None:
            # Pole just fell!
            self.steps_beyond_terminated = 0
            reward = 1.0
        else:
            if self.steps_beyond_terminated == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned terminated = True. You "
                    "should always call 'reset()' once you receive 'terminated = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_terminated += 1
            reward = 0.0

        self.renderer.render_step()
        return np.array(self.state, dtype=np.float32), reward, terminated, False, {}

    def reset(
        self,
        *,
        seed: Optional[int] = None,
        return_info: bool = False,
        options: Optional[dict] = None,
    ):
        super().reset(seed=seed)
        # Note that if you use custom reset bounds, it may lead to out-of-bound
        # state/observations.
        low, high = utils.maybe_parse_reset_bounds(
            options, -0.05, 0.05  # default low
        )  # default high
        self.state = self.np_random.uniform(low=low, high=high, size=(4,))
        self.steps_beyond_terminated = None
        self.renderer.reset()
        self.renderer.render_step()
        if not return_info:
            return np.array(self.state, dtype=np.float32)
        else:
            return np.array(self.state, dtype=np.float32), {}

    def render(self, mode="human"):
        if self.render_mode is not None:
            return self.renderer.get_renders()
        else:
            return self._render(mode)

    def _render(self, mode="human"):
        assert mode in self.metadata["render_modes"]
        try:
            import pygame
            from pygame import gfxdraw
        except ImportError:
            raise DependencyNotInstalled(
                "pygame is not installed, run `pip install gym[classic_control]`"
            )

        if self.screen is None:
            pygame.init()
            if mode == "human":
                pygame.display.init()
                self.screen = pygame.display.set_mode(
                    (self.screen_width, self.screen_height)
                )
            else:  # mode in {"rgb_array", "single_rgb_array"}
                self.screen = pygame.Surface((self.screen_width, self.screen_height))
        if self.clock is None:
            self.clock = pygame.time.Clock()

        world_width = self.x_threshold * 2
        scale = self.screen_width / world_width
        polewidth = 10.0
        polelen = scale * (2 * self.length)
        cartwidth = 50.0
        cartheight = 30.0

        if self.state is None:
            return None

        x = self.state

        self.surf = pygame.Surface((self.screen_width, self.screen_height))
        self.surf.fill((255, 255, 255))

        l, r, t, b = -cartwidth / 2, cartwidth / 2, cartheight / 2, -cartheight / 2
        axleoffset = cartheight / 4.0
        cartx = x[0] * scale + self.screen_width / 2.0  # MIDDLE OF CART
        carty = 100  # TOP OF CART
        cart_coords = [(l, b), (l, t), (r, t), (r, b)]
        cart_coords = [(c[0] + cartx, c[1] + carty) for c in cart_coords]
        gfxdraw.aapolygon(self.surf, cart_coords, (0, 0, 0))
        gfxdraw.filled_polygon(self.surf, cart_coords, (0, 0, 0))

        l, r, t, b = (
            -polewidth / 2,
            polewidth / 2,
            polelen - polewidth / 2,
            -polewidth / 2,
        )

        pole_coords = []
        for coord in [(l, b), (l, t), (r, t), (r, b)]:
            coord = pygame.math.Vector2(coord).rotate_rad(-x[2])
            coord = (coord[0] + cartx, coord[1] + carty + axleoffset)
            pole_coords.append(coord)
        gfxdraw.aapolygon(self.surf, pole_coords, (202, 152, 101))
        gfxdraw.filled_polygon(self.surf, pole_coords, (202, 152, 101))

        gfxdraw.aacircle(
            self.surf,
            int(cartx),
            int(carty + axleoffset),
            int(polewidth / 2),
            (129, 132, 203),
        )
        gfxdraw.filled_circle(
            self.surf,
            int(cartx),
            int(carty + axleoffset),
            int(polewidth / 2),
            (129, 132, 203),
        )

        gfxdraw.hline(self.surf, 0, self.screen_width, carty, (0, 0, 0))

        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))
        if mode == "human":
            pygame.event.pump()
            self.clock.tick(self.metadata["render_fps"])
            pygame.display.flip()

        elif mode in {"rgb_array", "single_rgb_array"}:
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2)
            )

    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False

2.3 源码详解

由于目标是“自建环境”的需求,因此不关心可视化的内容,因此重点关注在源码中参与环境交互的函数。

(1)inti()

初始化函数主要定义一些常量,用于参与物理模型的计算和阈值判断。
【关键代码及注释】

self.gravity = 9.8 #重力加速度g
self.masscart = 1.0 #车重量
self.masspole = 0.1 #杆重量
self.total_mass = (self.masspole + self.masscart) #(0.1 + 1) = 1.1,车和杆的重量
self.length = 0.5  # 计算杆高度用
self.polemass_length = (self.masspole * self.length) #(0.1 * 0.5) = 0.05,计算加速度用
self.force_mag = 10.0 #输入动作,每次施加给车的力
self.tau = 0.02  # 状态state更新之间的秒数
self.kinematics_integrator = 'euler' #运动学积分器,为了计算下一步state
 
# Angle at which to fail the episode 本局失败的角度
self.theta_threshold_radians = 12 * 2 * math.pi / 360  # 12 * 2 * 180° / 360° = 12 °
self.x_threshold = 2.4

(2)reset()

【关键代码】

		low, high = utils.maybe_parse_reset_bounds(
            options, -0.05, 0.05  # default low
        )  # default high
        self.state = self.np_random.uniform(low=low, high=high, size=(4,))
        self.steps_beyond_terminated = None

指定state 4项参数的上下界范围为(-0.05,0.05);
通过随机数的方式赋值state 4项参数,基于上下界范围;
重置 steps_beyond_terminated 属性为 None,该属性用于追踪在环境终止后还进行了多少步操作,这是一个安全措施,防止在环境已经终止的情况下继续执行步骤。

(3)step(self, action)

①获取当前state,action

        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action == 1 else -self.force_mag

②根据当前state中角度、角速度计算杆角速度thetaacc和小车加速度xacc

		costheta = math.cos(theta)
        sintheta = math.sin(theta)

        temp = (
            force + self.polemass_length * theta_dot**2 * sintheta
        ) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (
            self.length * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
        )
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

③根据设置的积分器euler更新state的4项参数

        if self.kinematics_integrator == "euler":
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot
            
        self.state = (x, x_dot, theta, theta_dot)

④检查小车的位置或杆的角度是否超出了预设的阈值,以确定是否达到了终止条件

        terminated = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not terminated:
            reward = 1.0
        elif self.steps_beyond_terminated is None:
            # Pole just fell!
            self.steps_beyond_terminated = 0
            reward = 1.0
        else:
            if self.steps_beyond_terminated == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned terminated = True. You "
                    "should always call 'reset()' once you receive 'terminated = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_terminated += 1
            reward = 0.0

⑤反馈环境单步交互情况

return np.array(self.state, dtype=np.float32), reward, terminated, False, {}

三、总结

CartPole_v1的源码还是比较简单,容易理解的,但基于“自建环境”的需求,还是稍微小结一下,在此之前个人认为应该做好如下准备:

  1. 确定算法框架:先从网上示例程序出发,以简单的交互环境(如CartPole)跑通算法(如A2C、PPO等),然后以该算法为基础确定与环境交互的部分有哪些。(若是只有数据,无需仅需环境交互,就采用离线RL)
  2. 马尔可夫建模:将实验过程基于MDP过程建模,将状态空间、动作空间、奖励设置、终止状态定义好,若发现不满足“序贯决策”特性,则考虑RL之外的其它方法。
  3. 确定计算模型most important):在CartPole中与环境交互的关键函数为env.step(self,action),其中基于当前state和积分器euler得以更新next state,在这过程中最关键的就是小车加速度、杆角加速度的计算模型。因此,环境交互本质就是“更新state及获取即时rewad”,最为关键的也就需要确定实验过程中应用于更新state的计算模型
  4. 设计交互函数:完成上述步骤后,其实就可以基于上述约束,以简单的“环境框架”(如CartPole)去自建自己实验的交互环境了。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值