【ICML 2024】Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

在这里插入图片描述
在这里插入图片描述
论文地址:https://arxiv.org/abs/2402.16801
代码地址:https://github.com/MichaelTMatthews/Craftax
基线地址:https://github.com/MichaelTMatthews/Craftax_Baselines
介绍:用JAX编写的Crafter升级版,加速了Crafter的运行速度。实现1小时交互1Bsteps。

个人测试:
Craftax扩展版:

A100 40GB - 占用30GB 用时 3.5h (集群)
4090 24GB - 占用18GB 用时 3.3h (工作站)
Craftax-classic:
A100 40GB - 占用30GB 用时 1.9h (集群)
4090 24GB - 占用17.5GB 用时 55min (工作站)

0 Abstract:

Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms. We identify that existing benchmarks used for research into open-ended learning fall into one of two categories. Either they are too slow for meaningful research to be performed without enormous computational resources, like Crafter, NetHack and Minecraft, or they are not complex enough to pose a significant challenge, like Minigrid and Procgen. To remedy this, we first present Craftax-Classic: a ground-up rewrite of Crafter in JAX that runs up to 250x faster than the Pyhton-native original. A run of PPO using 1 billion environment interactions finishes in under an hour using only a singly GPU and averages 90% of the optimal reward. To provide a more compelling challenge we present the main Crafter benchmar, a significant extension of the Crafter mechanics with elements inspired from NetHack. Solving Craftax requires deep exploration, long term planing and memory, as well as continual adaptation to novel situations as more of the word is discovered. We show that existing methods including global and episodic exploration, as well as unsupervised environment design fail to make material progress on the benchmark. We believe that Craftax can for the first time allow researchers to experiment in a complex, open-ended environment with limited computational resources.

在这里插入图片描述

1 Introduction

Progress in reinforcement learning algorithms is driven in large part by the development and adoption of suitable benchmarks. eg.

  • value-base deep RL algorithms: Arcade Learning environment-ALE 支持大约 57 款 Atari 2600 游戏,涵盖了许多不同的游戏类型和玩法。
  • continuous control:Mujoco
  • multi-agent RL: StarCraft

Towards general agents: exhibit open-ended dynamics-- MALMO(Minecraft) ; NetHack; MiniHack; Crafter need latge-scale computational resources

Craftax benchmark: a JAX-based environment exhibiting complex, open-end dynamics and running orders of magnitude faster than comparable environments.
在这里插入图片描述

2 Background

Crafter

  • Crafter was proposed to evaluate “strong generalization, deep exploration, and long-term reasoning”.
  • While Crafter has become a popular benchmark, the evaluation proposed allocates algorithms only million environment interactions, a very limiting constraint when compared to other RL benchmarks.

Open-Ended Learning

Open-Ended Learning refers to a broad category of problem and algorithms that focus on learning in a perpetual and unguided manner

Two subfields:

  • Exploration through Intrinsic Rewards:
    • max-entropy RL or ϵ \epsilon ϵ-greedy policies are insufficient
    • one method to overcome these difficulties is through the idea of an intrinsic reward
  • Unsupervised Environments Design:
    • an adversary proposes environment configurations (referred to as levels) for an agent to train on

3 Craftax

3.1 Craftax-Classic: A Reimplementation of Crafter in JAX
  • Symbolic: running around 10x faster than pixel-based
  • pixel-based observations: adds an extra layer of representation
3.2 Craftax: An Extension of Crafter with NetHack-Like Mechanics in JAX

a more compelling challenge
Craftax adds a large and diverse range of new game mechanics.

  • Multiple Floors
    • 9 unique procedurally generated floors
    • facilitate generalisation across different procedurally generated worlds but also generalisation of the exploration strategy through time over the learning process
  • Combat
    • damage: physical; fire; ice
    • weapons: use a bow for ranged attacks; read books to learn offensive spells to cast
    • defence: armour
    • problem: exploration problem, by design, there should not be one fixed strategy
    • in-context learning
  • New Creatures
    • Crafter 3 ; Craftax 19
  • Potions and Enchantments
    • test in-context learning and memory
  • Attributes
    • is rewarded with an experience point (descend to a new floor), for:
      • dexterity
      • strength
      • intelligence
    • testing the agents long term reasoning capabilities
  • Boss Floor
  • Difficulty
    • more challenging than Crafter, not as hard as NetHack
    • GUI, allowed for unlimited time to pause and think - Not Real-Time Strategy
    • while being ultimately achievable without domain knowledge
3.3 RL Environment Interface
  • Observation Space
    • pixel-based observation: 64x64x3 image for Craftax-Classic; 110x130x3 for Craftax
    • symbolic observation: a flat observation space of 1345 for Craftax-Classic; 8268 for Craftax
  • Action Space
    • 17 for Craftax-Classic; 43 for Craftax
    • every action can be taken at any timestep
  • Reward
    • giving the agent a reward first time each achievement is completed each episode
    • achievements: Craftax-Classic 22 ; Craftax 64, for categories- Basic; Intermediate; Advanced; Very Advanced, each worth 1,3,5,8 reward
    • damage and recovered: ±0.1
3.4 Evaluation Framework
  1. Craftax-1B Challenge

    • exploration
    • continual learning
    • long-term planning and reasoning

    averaging around 5 steps per second (the rate the authors moved), 1 billion timesteps corresponds to over 6 years of continual human gameplay.
    one of the authors (with extensive knowledge of the game mechanics) roughly 5 hours of gameplay to first achieve a ‘perfect’ run where every achievement was completed.

    Metric: Reward

  2. Craftax-1M Challenge

    • sample efficiency

    experiments can take only seconds to finish

4 Experiments

4.1 Exploration Baselines
  • PPO
    • 4 layer MLP of width 512 for both policy and value networks.
    • fully connected networks performed better than consistently
  • PPO-RNN

and intrinsic rewards for additional exploration:

4.2 UED Baselines

Unsupervised Environment Design

  • PLR
  • ACCEL
4.3 Craftax-1B

在这里插入图片描述 在这里插入图片描述

  • Figure 3: the returns for the evaluated algorithms on Craftax-1B

maximum achievable reward is 226.
achievable : 67

  • Basic: 25
  • Intermediate: 18
  • Advanced: 15
  • Very Advanced: 9
    So, maximum reward is 25*1+18*3+15*5+9*8=226
  • Figure 4: the final achievement yields split by difficulty

1 Interestingly, none of the tested exploration methods improved performance and E3B in fact significantly reduced the reward.
2 PPO-RNN does so significantly more than the others
3 Interestingly, the success rate on some simple achievements like DEFEAT_ZOMBIE decrease over time, with PPO-RNN doing the worst of all.

  • the stronger agents are trading of low-reward achievements in the overworld for high-reward ones under-ground.

4 Interesting, we also see that the EAT_PLAN (notable for being perhaps the hardest achievement in the original Crafter) is entirely ignored by PPO-RNN and is actually best achieved by E3B,

  • indicating that while the intrinsic reward may not be helping with overall return, it does incentivise the agent to explore different parts of the environment.

Conclusion

We hope that Craftax will facilitate research into areas including exploration, continual learning, generalisation, skill acquisition and long term reasoning.

😈 😃 😇 😅 😂 😉 😐 😗 😕

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值