【动手学强化学习】番外1-CartPole_v1环境源码详解及自建环境建议

SI_Gamer

已于 2024-12-13 21:24:07 修改

阅读量1.7k

点赞数 28

CC 4.0 BY-SA版权

分类专栏：强化学习入门文章标签：机器学习人工智能个人开发

于 2024-12-12 17:45:54 首次发布

本文链接：https://blog.youkuaiyun.com/lxlyhwl/article/details/144428997

强化学习入门专栏收录该内容

16 篇文章

订阅专栏

理解经典强化学习交互环境CartPole_v1，为自己创建实验交互环境建立基础框架。

文章目录

一、环境介绍
- 1.1 环境概述
- 1.2 环境参数
二、环境源码
三、总结

一、环境介绍

1.1 环境概述

Cart Pole即车杆游戏，游戏模型如下图所示。游戏里面有一个小车，上有竖着一根杆子，每次重置后的初始状态会有所不同。小车需要左右移动来保持杆子竖直，为了保证游戏继续进行需要满足以下两个条件：

杆子倾斜的角度必须保持在[-12°,12°]之间；
小车移动的位置需保持在一定范围，[-2.4,2.4]单位长度之间。

在这里插入图片描述

1.2 环境参数

【状态-state】
作为强化学习的经典环境，CartPole_v1中，state状态包含4项参数：

观测参数	含义	最小值	最大值
Cart Position	小车位置	-4.8	4.8
Cart Velocity	小车速度	-Inf	Inf
Pole Angle	杆角度	-24°	24°
Pole Angular Velocity	杆角速度	-Inf	Inf

（小车位置、杆角度的变化范围设置为判断阈值的2倍）

【动作-action】
CartPole_v1中，action space仅为{0,1}，代表force推力，其含义如下。其中，force会影响杆角加速度与小车加速度。

action value	含义
0	负向力(Push cart to the left)
1	正向力(Push cart to the right)

【奖励-reward】
每坚持一帧，奖励就+1。一帧即代表每进行一次与环境的交互（env.step(state,action)），并且小车位置和杆角度未超过判断阈值，但最大累计奖励为500。

【终止状态设置-termination】

终止：极点角度大于±12°
终止：推车位置大于±2.4（推车中心到达显示屏边缘）
截断：episode长度大于 500（v0 为 200）

因此，该CartPole_v1是一个state连续向量，action离散变量的环境。

二、环境源码

2.1 源码来源

（1）根据gym安装路径寻找cartpole.py文件，路径如下：

~envs\【自定义conda虚拟环境名称】\Lib\site-packages\gym\envs\classic_control\cartpole.py

（2）openai gym github网站，链接如下：
https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

2.2 完整源码

"""
Classic cart-pole system implemented by Rich Sutton et al.
Copied from http://incompleteideas.net/sutton/book/code/pole.c
permalink: https://perma.cc/C9ZM-652R
"""
import math
from typing import Optional, Union

import numpy as np

import gym
from gym import logger, spaces
from gym.envs.classic_control import utils
from gym.error import DependencyNotInstalled
from gym.utils.renderer import Renderer


class CartPoleEnv(gym.Env[np.ndarray, Union[int, np.ndarray]]):
    
    metadata = {
        "render_modes": ["human", "rgb_array", "single_rgb_array"],
        "render_fps": 50,
    }

    def __init__(self, render_mode: Optional[str] = None):
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = self.masspole + self.masscart
        self.length = 0.5  # actually half the pole's length
        self.polemass_length = self.masspole * self.length
        self.force_mag = 10.0
        self.tau = 0.02  # seconds between state updates
        self.kinematics_integrator = "euler"

        # Angle at which to fail the episode
        self.theta_threshold_radians = 12 * 2 * math.pi / 360
        self.x_threshold = 2.4

        # Angle limit set to 2 * theta_threshold_radians so failing observation
        # is still within bounds.
        high = np.array(
            [
                self.x_threshold * 2,
                np.finfo(np.float32).max,
                self.theta_threshold_radians * 2,
                np.finfo(np.float32).max,
            ],
            dtype=np.float32,
        )

        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Box(-high, high, dtype=np.float32)

        self.render_mode = render_mode
        self.renderer = Renderer(self.render_mode, self._render)

        self.screen_width = 600
        self.screen_height = 400
        self.screen = None
        self.clock = None
        self.isopen = True
        self.state = None

        self.steps_beyond_terminated = None

    def step(self, action):
        err_msg = f"{action!r} ({type(action)}) invalid"
        assert self.action_space.contains(action), err_msg
        assert self.state is not None, "Call reset before using step method."
        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action == 1 else -self.force_mag
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        # For the interested reader:
        # https://coneural.org/florian/papers/05_cart_pole.pdf
        temp = (
            force + self.polemass_length * theta_dot**2 * sintheta
        ) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (
            self.length * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
        )
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

        if self.kinematics_integrator == "euler":
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        self.state = (x, x_dot, theta, theta_dot)

        terminated = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not terminated:
            reward = 1.0
        elif self.steps_beyond_terminated is None:
            # Pole just fell!
            self.steps_beyond_terminated = 0
            reward = 1.0
        else:
            if self.steps_beyond_terminated == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned terminated = True. You "
                    "should always call 'reset()' once you receive 'terminated = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_terminated += 1
            reward = 0.0

        self.renderer.render_step()
        return np.array(self.state, dtype=np.float32), reward, terminated, False, {}

    def reset(
        self,
        *,
        seed: Optional[int] = None,
        return_info: bool = False,
        options: Optional[dict] = None,
    ):
        super().reset(seed=seed)
        # Note that if you use custom reset bounds, it may lead to out-of-bound
        # state/observations.
        low, high = utils.maybe_parse_reset_bounds(
            options, -0.05, 0.05  # default low
        )  # default high
        self.state = self.np_random.uniform(low=low, high=high, size=(4,))
        self.steps_beyond_terminated = None
        self.renderer.reset()
        self.renderer.render_step()
        if not return_info:
            return np.array(self.state, dtype=np.float32)
        else:
            return np.array(self.state, dtype=np.float32), {}

    def render(self, mode="human"):
        if self.render_mode is not None:
            return self.renderer.get_renders()
        else:
            return self._render(mode)

    def _render(self, mode="human"):
        assert mode in self.metadata["render_modes"]
        try:
            import pygame
            from pygame import gfxdraw
        except ImportError:
            raise DependencyNotInstalled(
                "pygame is not installed, run `pip install gym[classic_control]`"
            )

        if self.screen is None:
            pygame.init()
            if mode == "human":
                pygame.display.init()
                self.screen = pygame.display.set_mode(
                    (self.screen_width, self.screen_height)
                )
            else:  # mode in {"rgb_array", "single_rgb_array"}
                self.screen = pygame.Surface((self.screen_width, self.screen_height))
        if self.clock is None:
            self.clock = pygame.time.Clock()

        world_width = self.x_threshold * 2
        scale = self.screen_width / world_width
        polewidth = 10.0
        polelen = scale * (2 * self.length)
        cartwidth = 50.0
        cartheight = 30.0

        if self.state is None:
            return None

        x = self.state

        self.surf = pygame.Surface((self.screen_width, self.screen_height))
        self.surf.fill((255, 255, 255))

        l, r, t, b = -cartwidth / 2, cartwidth / 2, cartheight / 2, -cartheight / 2
        axleoffset = cartheight / 4.0
        cartx = x[0] * scale + self.screen_width / 2.0  # MIDDLE OF CART
        carty = 100  # TOP OF CART
        cart_coords = [(l, b), (l, t), (r, t), (r, b)]
        cart_coords = [(c[0] + cartx, c[1] + carty) for c in cart_coords]
        gfxdraw.aapolygon(self.surf, cart_coords, (0, 0, 0))
        gfxdraw.filled_polygon(self.surf, cart_coords, (0, 0, 0))

        l, r, t, b = (
            -polewidth / 2,
            polewidth / 2,
            polelen - polewidth / 2,
            -polewidth / 2,
        )

        pole_coords = []
        for coord in [(l, b), (l, t), (r, t), (r, b)]:
            coord = pygame.math.Vector2(coord).rotate_rad(-x[2])
            coord = (coord[0] + cartx, coord[1] + carty + axleoffset)
            pole_coords.append(coord)
        gfxdraw.aapolygon(self.surf, pole_coords, (202, 152, 101))
        gfxdraw.filled_polygon(self.surf, pole_coords, (202, 152, 101))

        gfxdraw.aacircle(
            self.surf,
            int(cartx),
            int(carty + axleoffset),
            int(polewidth / 2),
            (129, 132, 203),
        )
        gfxdraw.filled_circle(
            self.surf,
            int(cartx),
            int(carty + axleoffset),
            int(polewidth / 2),
            (129, 132, 203),
        )

        gfxdraw.hline(self.surf, 0, self.screen_width, carty, (0, 0, 0))

        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))
        if mode == "human":
            pygame.event.pump()
            self.clock.tick(self.metadata["render_fps"])
            pygame.display.flip()

        elif mode in {"rgb_array", "single_rgb_array"}:
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2)
            )

    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False

2.3 源码详解

由于目标是“自建环境”的需求，因此不关心可视化的内容，因此重点关注在源码中参与环境交互的函数。

（1）inti()

初始化函数主要定义一些常量，用于参与物理模型的计算和阈值判断。
【关键代码及注释】

self.gravity = 9.8 #重力加速度g
self.masscart = 1.0 #车重量
self.masspole = 0.1 #杆重量
self.total_mass = (self.masspole + self.masscart) #（0.1 + 1） = 1.1，车和杆的重量
self.length = 0.5  # 计算杆高度用
self.polemass_length = (self.masspole * self.length) #（0.1 * 0.5） = 0.05，计算加速度用
self.force_mag = 10.0 #输入动作，每次施加给车的力
self.tau = 0.02  # 状态state更新之间的秒数
self.kinematics_integrator = 'euler' #运动学积分器，为了计算下一步state
 
# Angle at which to fail the episode 本局失败的角度
self.theta_threshold_radians = 12 * 2 * math.pi / 360  # 12 * 2 * 180° / 360° = 12 °
self.x_threshold = 2.4

（2）reset()

【关键代码】

		low, high = utils.maybe_parse_reset_bounds(
            options, -0.05, 0.05  # default low
        )  # default high
        self.state = self.np_random.uniform(low=low, high=high, size=(4,))
        self.steps_beyond_terminated = None

指定state 4项参数的上下界范围为(-0.05,0.05)；
通过随机数的方式赋值state 4项参数，基于上下界范围；
重置 steps_beyond_terminated 属性为 None，该属性用于追踪在环境终止后还进行了多少步操作，这是一个安全措施，防止在环境已经终止的情况下继续执行步骤。

（3）step(self, action)

①获取当前state，action

        x, x_dot, theta, theta_dot = self.state
        force = self.force_mag if action == 1 else -self.force_mag

②根据当前state中角度、角速度计算杆角速度thetaacc和小车加速度xacc

		costheta = math.cos(theta)
        sintheta = math.sin(theta)

        temp = (
            force + self.polemass_length * theta_dot**2 * sintheta
        ) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp) / (
            self.length * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
        )
        xacc = temp - self.polemass_length * thetaacc * costheta / self.total_mass

③根据设置的积分器euler更新state的4项参数

        if self.kinematics_integrator == "euler":
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot
            
        self.state = (x, x_dot, theta, theta_dot)

④检查小车的位置或杆的角度是否超出了预设的阈值，以确定是否达到了终止条件

        terminated = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
        )

        if not terminated:
            reward = 1.0
        elif self.steps_beyond_terminated is None:
            # Pole just fell!
            self.steps_beyond_terminated = 0
            reward = 1.0
        else:
            if self.steps_beyond_terminated == 0:
                logger.warn(
                    "You are calling 'step()' even though this "
                    "environment has already returned terminated = True. You "
                    "should always call 'reset()' once you receive 'terminated = "
                    "True' -- any further steps are undefined behavior."
                )
            self.steps_beyond_terminated += 1
            reward = 0.0

⑤反馈环境单步交互情况

return np.array(self.state, dtype=np.float32), reward, terminated, False, {}

三、总结

CartPole_v1的源码还是比较简单，容易理解的，但基于“自建环境”的需求，还是稍微小结一下，在此之前个人认为应该做好如下准备：

确定算法框架：先从网上示例程序出发，以简单的交互环境（如CartPole）跑通算法（如A2C、PPO等），然后以该算法为基础确定与环境交互的部分有哪些。（若是只有数据，无需仅需环境交互，就采用离线RL）
马尔可夫建模：将实验过程基于MDP过程建模，将状态空间、动作空间、奖励设置、终止状态定义好，若发现不满足“序贯决策”特性，则考虑RL之外的其它方法。
确定计算模型（most important）：在CartPole中与环境交互的关键函数为env.step(self,action)，其中基于当前state和积分器euler得以更新next state，在这过程中最关键的就是小车加速度、杆角加速度的计算模型。因此，环境交互本质就是“更新state及获取即时rewad”，最为关键的也就需要确定实验过程中应用于更新state的计算模型。
设计交互函数：完成上述步骤后，其实就可以基于上述约束，以简单的“环境框架”（如CartPole）去自建自己实验的交互环境了。