【ICML 2024】Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

本文链接：https://blog.youkuaiyun.com/weixin_42336668/article/details/146088452

在这里插入图片描述

论文地址:https://arxiv.org/abs/2402.16801
代码地址:https://github.com/MichaelTMatthews/Craftax
基线地址:https://github.com/MichaelTMatthews/Craftax_Baselines
介绍：用JAX编写的Crafter升级版，加速了Crafter的运行速度。实现1小时交互1Bsteps。

个人测试：
Craftax扩展版：
A100 40GB - 占用30GB 用时 3.5h （集群）
4090 24GB - 占用18GB 用时 3.3h (工作站)
Craftax-classic:
A100 40GB - 占用30GB 用时 1.9h （集群）
4090 24GB - 占用17.5GB 用时 55min (工作站)

0 Abstract：

Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms. We identify that existing benchmarks used for research into open-ended learning fall into one of two categories. Either they are too slow for meaningful research to be performed without enormous computational resources, like Crafter, NetHack and Minecraft, or they are not complex enough to pose a significant challenge, like Minigrid and Procgen. To remedy this, we first present Craftax-Classic: a ground-up rewrite of Crafter in JAX that runs up to 250x faster than the Pyhton-native original. A run of PPO using 1 billion environment interactions finishes in under an hour using only a singly GPU and averages 90% of the optimal reward. To provide a more compelling challenge we present the main Crafter benchmar, a significant extension of the Crafter mechanics with elements inspired from NetHack. Solving Craftax requires deep exploration, long term planing and memory, as well as continual adaptation to novel situations as more of the word is discovered. We show that existing methods including global and episodic exploration, as well as unsupervised environment design fail to make material progress on the benchmark. We believe that Craftax can for the first time allow researchers to experiment in a complex, open-ended environment with limited computational resources.

在这里插入图片描述

1 Introduction

Progress in reinforcement learning algorithms is driven in large part by the development and adoption of suitable benchmarks. eg.

value-base deep RL algorithms: Arcade Learning environment-ALE 支持大约 57 款 Atari 2600 游戏，涵盖了许多不同的游戏类型和玩法。
continuous control：Mujoco
multi-agent RL: StarCraft

Towards general agents: exhibit open-ended dynamics-- MALMO(Minecraft) ; NetHack; MiniHack; Crafter need latge-scale computational resources

Craftax benchmark: a JAX-based environment exhibiting complex, open-end dynamics and running orders of magnitude faster than comparable environments.
在这里插入图片描述

2 Background

Crafter

Crafter was proposed to evaluate “strong generalization, deep exploration, and long-term reasoning”.
While Crafter has become a popular benchmark, the evaluation proposed allocates algorithms only million environment interactions, a very limiting constraint when compared to other RL benchmarks.

Open-Ended Learning

Open-Ended Learning refers to a broad category of problem and algorithms that focus on learning in a perpetual and unguided manner

Two subfields:

Exploration through Intrinsic Rewards:
- max-entropy RL or $\epsilon$ -greedy policies are insufficient
- one method to overcome these difficulties is through the idea of an intrinsic reward
Unsupervised Environments Design:
- an adversary proposes environment configurations (referred to as levels) for an agent to train on

3 Craftax

3.1 Craftax-Classic: A Reimplementation of Crafter in JAX

Symbolic: running around 10x faster than pixel-based
pixel-based observations: adds an extra layer of representation

3.2 Craftax: An Extension of Crafter with NetHack-Like Mechanics in JAX

a more compelling challenge
Craftax adds a large and diverse range of new game mechanics.

Multiple Floors
- 9 unique procedurally generated floors
- facilitate generalisation across different procedurally generated worlds but also generalisation of the exploration strategy through time over the learning process
Combat
- damage: physical; fire; ice
- weapons: use a bow for ranged attacks; read books to learn offensive spells to cast
- defence: armour
- problem: exploration problem, by design, there should not be one fixed strategy
- in-context learning
New Creatures
- Crafter 3 ; Craftax 19
Potions and Enchantments
- test in-context learning and memory
Attributes
- is rewarded with an experience point (descend to a new floor)， for:
  - dexterity
  - strength
  - intelligence
- testing the agents long term reasoning capabilities
Boss Floor
Difficulty
- more challenging than Crafter, not as hard as NetHack
- GUI, allowed for unlimited time to pause and think - Not Real-Time Strategy
- while being ultimately achievable without domain knowledge

3.3 RL Environment Interface

Observation Space
- pixel-based observation: 64x64x3 image for Craftax-Classic; 110x130x3 for Craftax
- symbolic observation: a flat observation space of 1345 for Craftax-Classic; 8268 for Craftax
Action Space
- 17 for Craftax-Classic; 43 for Craftax
- every action can be taken at any timestep
Reward
- giving the agent a reward first time each achievement is completed each episode
- achievements: Craftax-Classic 22 ; Craftax 64, for categories- Basic; Intermediate; Advanced; Very Advanced, each worth 1,3,5,8 reward
- damage and recovered: ±0.1

3.4 Evaluation Framework

Craftax-1B Challenge
- exploration
- continual learning
- long-term planning and reasoning
averaging around 5 steps per second (the rate the authors moved), 1 billion timesteps corresponds to over 6 years of continual human gameplay.
one of the authors (with extensive knowledge of the game mechanics) roughly 5 hours of gameplay to first achieve a ‘perfect’ run where every achievement was completed.

Metric: Reward
Craftax-1M Challenge
- sample efficiency
experiments can take only seconds to finish

4 Experiments

4.1 Exploration Baselines

PPO
- 4 layer MLP of width 512 for both policy and value networks.
- fully connected networks performed better than consistently
PPO-RNN

and intrinsic rewards for additional exploration:

Random Network Distillation:
Intrinsic Curiosity Module: learns a world model and provides intrinsic reward proportional to the error of this world model
Exploration via Elliptical Episodic Bonuses:

4.2 UED Baselines

Unsupervised Environment Design

PLR
ACCEL

4.3 Craftax-1B

在这里插入图片描述

Figure 3: the returns for the evaluated algorithms on Craftax-1B

maximum achievable reward is 226.
achievable : 67

Basic: 25
Intermediate: 18
Advanced: 15
Very Advanced: 9
So, maximum reward is 25*1+18*3+15*5+9*8=226

Figure 4: the final achievement yields split by difficulty

1 Interestingly, none of the tested exploration methods improved performance and E3B in fact significantly reduced the reward.
2 PPO-RNN does so significantly more than the others
3 Interestingly, the success rate on some simple achievements like DEFEAT_ZOMBIE decrease over time, with PPO-RNN doing the worst of all.

the stronger agents are trading of low-reward achievements in the overworld for high-reward ones under-ground.

4 Interesting, we also see that the EAT_PLAN (notable for being perhaps the hardest achievement in the original Crafter) is entirely ignored by PPO-RNN and is actually best achieved by E3B,

indicating that while the intrinsic reward may not be helping with overall return, it does incentivise the agent to explore different parts of the environment.

Conclusion

We hope that Craftax will facilitate research into areas including exploration, continual learning, generalisation, skill acquisition and long term reasoning.

😈 😃 😇 😅 😂 😉 😐 😗 😕