On Reinforcement Learning for Full-length Game of StarCraft

本文旨在用强化学习(RL)解决星际争霸中的巨大状态空间、可变动作空间等挑战。研究了一种分层方法,包含宏观动作和两层分层架构。介绍了分层架构中两种策略的运行机制,阐述了宏观动作的生成过程,还提出了训练算法,实验在SC2LE上取得了先进成果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Research Topic

如果要用RL解决StarCraft的challenge: huge state space, varying action space, long horizon, etc.

研究了一种hierarchical approach,involves two levels of abstraction:

  • macro-actions
    The macro-actions are extracted from expert’s demonstration trajectories, which can reduce the action space in an order of magnitude yet remains effective.
  • two-layer hierarchical architecture
    is modular and easy to scale.

The main contributions of this paper are as follows:

  • We investigate a hierarchical architecture which makes large-scale SC2 problem easier to handle.
  • A simple yet effective training algorithm for this architecture is also presented.
  • We study in detail the impact of different training settings on our architecture.
  • Experiments results on SC2LE show that our methods achieves state-of-the-art results.

Method

Overall Architecture

Hierarchical Architecture

两种policies (controller and sub-policy) running in different timescales.
The controller choose a sub-policy based on current observation every long time interval, and the sub-policy picks a macro-action every short time interval.

The whole process:

  1. At time tct_{c}tc, the controller gets its own global observation stccs_{t_{c}}^{c}stcc, and it will choose a sub-policy iii based on its state:
    atcc=∏(stcc),stcc∈Sca_{t_{c}}^{c} = \prod (s_{t_{c}}^{c}), s_{t_{c}}^{c} \in S_{c}atcc=(stcc),stccSc
  2. The controller will wait for K time units and the iiith sub-policy begins to make its move. We assume its current time is tit_{i}ti and its local observation is stiis_{t_{i}}^{i}stii, so it get the macro-action:
    atii=πi(stii)a_{t_{i}}^{i} = \pi_{i}(s_{t_{i}}^{i})atii=πi(stii)
  3. After the iiith sub-policy doing the macro-action atiia_{t_{i}}^{i}atii in the game, it will get the reward and its next local observation, the tuple (stii,atii,rtii,sti+1i)(s_{t_{i}}^{i}, a_{t_{i}}^{i}, r_{t_{i}}^{i}, s_{t_{i+1}}^{i})(stii,atii,rtii,sti+1i) will be stored for the future training.
    rtii=Ri(stii,atii)r_{t_{i}}^{i} = R_{i}(s_{t_{i}}^{i}, a_{t_{i}}^{i})rtii=Ri(stii,atii)
  4. After K moves, it will return to the controller and wait for the next change. At the same time, the controller gets the return of the chosen sub-policy πi\pi_{i}πi and compute the reward of its action atcca_{t_{c}}^{c}atcc as follows:
    rtcc=rtii+rti+1i+...+rti+K−1ir_{t_{c}}^{c} = r_{t_{i}}^{i} + r_{t_{i+1}}^{i}+ ... + r_{t_{i+K-1}}^{i}rtcc=rtii+rti+1i+...+rti+K1i
    Also, the controller will get the next global state stc+1cs_{t_{c+1}}^{c}stc+1c and the tuple will be stored in its local buffer.

这种hierarchical architecture 的优势:

  • Each sub-policy and the high-level controller have different state space.
  • The hierarchical structure can also split the tremendous action space A.
  • The hierarchical architecture can effectively reduce the execution step size of the strategy.

Generation of Macro-actions

The generation process of macro-actions is as follow:

  1. We collect some expert trajectories which are sequence of operations a∈Aa \in AaA from game replays.
  2. 用prefix-span算法mine the relationship of the each operation and combine the related operations to be a sequence of actions aseqa^{seq}aseq of which max length is CCC and constructed a set AseqA^{seq}Aseq which is defined as
    Aseq=(aseq=(a0,a1,a2,...,ai)∣ai∈Aandi⩽C)A^{seq} = (a^{seq} = (a_{0}, a_{1}, a_{2}, ..., a_{i}) \mid a_{i} \in A \quad and \quad i \leqslant C)Aseq=(aseq=(a0,a1,a2,...,ai)aiAandiC)
  3. We sort this set by frequency (aseq)(a^{seq})(aseq)
  4. We remove duplicated and meaningless ones, remain the top K ones. Meaningless refers to the sequences like continuous selection or cameras movement
  5. The reduced set are marked as newly generated macro-action space AηA^{\eta}Aη.

在这里插入图片描述

Training Algorithm

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值