gym-super-mario-bros游戏环境笔记
gym-super-mario-bros游戏环境笔记
最近在学习Intrinsic Reward Model相关的paper,super-mario-bros可以说是算法性能测试的标配游戏环境了,可惜之前太多关注点都放在Atari上,特此开一篇笔记记录一下内容,以备后查。

简介
项目地址https://pypi.org/project/gym-super-mario-bros/
安装
pip install nes-py
pip install gym-super-mario-bros
需要在Ubuntu下安装,Windows不行。
Demo
游戏结束的条件应该有两个:3条命没了,或者超时了。具体实践时应该要设置一个最大探索长度。
Gym demo
from nes_py.wrappers import JoypadSpace
import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, SIMPLE_MOVEMENT)
done = True
for step in range(5000):
if done:
state = env.reset()
state, reward, done, info = env.step(env.action_space.sample())
env.render()
env.close()
命令行demo
gym_super_mario_bros -e <the environment ID to play> -m <`human` or `random`>
-e默认选项是SuperMarioBros-v0,-m默认选项是human。
选择human时,键盘A D是左右移动,O是跳跃,长按O大跳。
环境
3条命玩32关。模拟器把无关画面全部剔除了,无关画面是指和agent操作无关的画面,比如过长画面。
| Environment | Game | ROM | Screenshot |
|---|---|---|---|
| SuperMarioBros-v0 | SMB | standard | ![]() |
| SuperMarioBros-v1 | SMB | downsample 降采样版本 | ![]() |
| SuperMarioBros-v2 | SMB | pixel | ![]() |
| SuperMarioBros-v3 | SMB | rectangle | ![]() |
| SuperMarioBros2-v0 | SMB2 | standard | ![]() |
| SuperMarioBros2-v1 | SMB2 | downsample | ![]() |
单独关卡
也可以单独训练一个关卡,Environment可以这样写:
SuperMarioBros-<world>-<stage>-v<version>
- is a number in {1, 2, 3, 4, 5, 6, 7, 8} indicating the world
- is a number in {1, 2, 3, 4} indicating the stage within a world
- is a number in {0, 1, 2, 3} specifying the ROM mode to use
- 0: standard ROM
- 1: downsampled ROM
- 2: pixel ROM
- 3: rectangle ROM
For example, to play 4-2 on the downsampled ROM, you would use the environment id SuperMarioBros-4-2-v1.
随机选择关卡
随机选择一个关卡,并且只有一条命,死掉并reset之后会再随机选择一个关卡。此功能只对SMB有效,对SMB2无效。
示例代码:SuperMarioBrosRandomStages-v0。
设置种子:env.seed(1),在调用reset前设置一下。
奖励函数
奖励功能假定游戏的目标是尽可能快地向右移动(增加Agent的x值)而不会死。为了建模这个游戏,奖励由三个独立的变量组成:
- v: the difference in agent x values between states
- in this case this is instantaneous velocity for the given step
- v = x1 - x0
- x0 is the x position before the step
- x1 is the x position after the step
- moving right ⇔ v > 0
- moving left ⇔ v < 0
- not moving ⇔ v = 0
- c: the difference in the game clock between frames
the penalty prevents the agent from standing still
c = c0 - c1
c0 is the clock reading before the step
c1 is the clock reading after the step
no clock tick ⇔ c = 0
clock tick ⇔ c < 0 - d: a death penalty that penalizes the agent for dying in a state
this penalty encourages the agent to avoid death
alive ⇔ d = 0
dead ⇔ d = -15
r = v + c + d
The reward is clipped into the range (-15, 15).
info内容解读
The info dictionary returned by the step method contains the following keys:
| Key | Type | Description |
|---|---|---|
| coins | int | The number of collected coins |
| flag_get | bool | True if Mario reached a flag or ax |
| life | int | The number of lives left, i.e., {3, 2, 1} |
| score | int | The cumulative in-game score |
| stage | int | The current stage, i.e., {1, …, 4} |
| status | str | Mario’s status, i.e., {‘small’, ‘tall’, ‘fireball’} |
| time | int | The time left on the clock |
| world | int | The current world, i.e., {1, …, 8} |
| x_pos | int | Mario’s x position in the stage (from the left) |
| y_pos | int | Mario’s y position in the stage (from the bottom) |
这篇笔记介绍了如何利用gym-super-mario-bros创建马里奥游戏环境进行强化学习实验。内容涵盖环境安装、演示、单独关卡设置、随机关卡选择以及奖励函数的解析。此外,还详细列举了游戏状态信息,包括生命、分数、关卡等关键数据。







1823

被折叠的 条评论
为什么被折叠?



