强化学习经典算法笔记(二十一)：gym-super-mario-bros游戏环境笔记

最新推荐文章于 2025-11-15 18:05:32 发布

原创最新推荐文章于 2025-11-15 18:05:32 发布 · 8.7k 阅读

24 ·

CC 4.0 BY-SA版权

文章标签：

#强化学习 #游戏 #深度学习 #pytorch #机器学习

强化学习专栏收录该内容

38 篇文章

订阅专栏

这篇笔记介绍了如何利用gym-super-mario-bros创建马里奥游戏环境进行强化学习实验。内容涵盖环境安装、演示、单独关卡设置、随机关卡选择以及奖励函数的解析。此外，还详细列举了游戏状态信息，包括生命、分数、关卡等关键数据。

gym-super-mario-bros游戏环境笔记

gym-super-mario-bros游戏环境笔记

gym-super-mario-bros游戏环境笔记

最近在学习Intrinsic Reward Model相关的paper，super-mario-bros可以说是算法性能测试的标配游戏环境了，可惜之前太多关注点都放在Atari上，特此开一篇笔记记录一下内容，以备后查。
在这里插入图片描述

简介

项目地址https://pypi.org/project/gym-super-mario-bros/

安装

pip install nes-py
pip install gym-super-mario-bros

需要在Ubuntu下安装，Windows不行。

Demo

游戏结束的条件应该有两个：3条命没了，或者超时了。具体实践时应该要设置一个最大探索长度。

Gym demo

from nes_py.wrappers import JoypadSpace
import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, SIMPLE_MOVEMENT)

done = True
for step in range(5000):
    if done:
        state = env.reset()
    state, reward, done, info = env.step(env.action_space.sample())
    env.render()

env.close()

命令行demo

gym_super_mario_bros -e <the environment ID to play> -m <`human` or `random`>

-e默认选项是SuperMarioBros-v0，-m默认选项是human。
选择human时，键盘A D是左右移动，O是跳跃，长按O大跳。

环境

3条命玩32关。模拟器把无关画面全部剔除了，无关画面是指和agent操作无关的画面，比如过长画面。

Environment	Game	ROM
SuperMarioBros-v0	SMB	standard
SuperMarioBros-v1	SMB	downsample 降采样版本
SuperMarioBros-v2	SMB	pixel
SuperMarioBros-v3	SMB	rectangle
SuperMarioBros2-v0	SMB2	standard
SuperMarioBros2-v1	SMB2	downsample

单独关卡

也可以单独训练一个关卡，Environment可以这样写：

SuperMarioBros-<world>-<stage>-v<version>

is a number in {1, 2, 3, 4, 5, 6, 7, 8} indicating the world
is a number in {1, 2, 3, 4} indicating the stage within a world
is a number in {0, 1, 2, 3} specifying the ROM mode to use
- 0: standard ROM
- 1: downsampled ROM
- 2: pixel ROM
- 3: rectangle ROM

For example, to play 4-2 on the downsampled ROM, you would use the environment id SuperMarioBros-4-2-v1.

随机选择关卡

随机选择一个关卡，并且只有一条命，死掉并reset之后会再随机选择一个关卡。此功能只对SMB有效，对SMB2无效。
示例代码：SuperMarioBrosRandomStages-v0。
设置种子：env.seed(1)，在调用reset前设置一下。

奖励函数

奖励功能假定游戏的目标是尽可能快地向右移动（增加Agent的x值）而不会死。为了建模这个游戏，奖励由三个独立的变量组成：

v: the difference in agent x values between states
- in this case this is instantaneous velocity for the given step
- v = x1 - x0
  - x0 is the x position before the step
  - x1 is the x position after the step
- moving right ⇔ v > 0
- moving left ⇔ v < 0
- not moving ⇔ v = 0
c: the difference in the game clock between frames
the penalty prevents the agent from standing still
c = c0 - c1
c0 is the clock reading before the step
c1 is the clock reading after the step
no clock tick ⇔ c = 0
clock tick ⇔ c < 0
d: a death penalty that penalizes the agent for dying in a state
this penalty encourages the agent to avoid death
alive ⇔ d = 0
dead ⇔ d = -15

r = v + c + d
The reward is clipped into the range (-15, 15).

info内容解读

The info dictionary returned by the step method contains the following keys:

Key	Type	Description
coins	int	The number of collected coins
flag_get	bool	True if Mario reached a flag or ax
life	int	The number of lives left, i.e., {3, 2, 1}
score	int	The cumulative in-game score
stage	int	The current stage, i.e., {1, …, 4}
status	str	Mario’s status, i.e., {‘small’, ‘tall’, ‘fireball’}
time	int	The time left on the clock
world	int	The current world, i.e., {1, …, 8}
x_pos	int	Mario’s x position in the stage (from the left)
y_pos	int	Mario’s y position in the stage (from the bottom)