On Reinforcement Learning for Full-length Game of StarCraft-优快云博客

本文旨在用强化学习（RL）解决星际争霸中的巨大状态空间、可变动作空间等挑战。研究了一种分层方法，包含宏观动作和两层分层架构。介绍了分层架构中两种策略的运行机制，阐述了宏观动作的生成过程，还提出了训练算法，实验在SC2LE上取得了先进成果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Research Topic

如果要用RL解决StarCraft的challenge: huge state space, varying action space, long horizon, etc.

研究了一种hierarchical approach，involves two levels of abstraction:

macro-actions
The macro-actions are extracted from expert’s demonstration trajectories, which can reduce the action space in an order of magnitude yet remains effective.
two-layer hierarchical architecture
is modular and easy to scale.

The main contributions of this paper are as follows:

We investigate a hierarchical architecture which makes large-scale SC2 problem easier to handle.
A simple yet effective training algorithm for this architecture is also presented.
We study in detail the impact of different training settings on our architecture.
Experiments results on SC2LE show that our methods achieves state-of-the-art results.

Method

Overall Architecture

Hierarchical Architecture

两种policies (controller and sub-policy) running in different timescales.
The controller choose a sub-policy based on current observation every long time interval, and the sub-policy picks a macro-action every short time interval.

The whole process:

At time $t_{c}$ , the controller gets its own global observation $s_{t_{c}}^{c}$ , and it will choose a sub-policy $i$ based on its state:
$atcc=∏(stcc),stcc∈Sca_{t_{c}}^{c} = \prod (s_{t_{c}}^{c}), s_{t_{c}}^{c} \in S_{c}$
The controller will wait for K time units and the $i$ th sub-policy begins to make its move. We assume its current time is $t_{i}$ and its local observation is $s_{t_{i}}^{i}$ , so it get the macro-action:
$atii=πi(stii)a_{t_{i}}^{i} = \pi_{i}(s_{t_{i}}^{i})$
After the $i$ th sub-policy doing the macro-action $a_{t_{i}}^{i}$ in the game, it will get the reward and its next local observation, the tuple $s_{t_{i}}^{i}, a_{t_{i}}^{i}, r_{t_{i}}^{i}, s_{t_{i+1}}^{i})$ will be stored for the future training.
$r_{t_{i}}^{i} = R_{i}(s_{t_{i}}^{i}, a_{t_{i}}^{i})$
After K moves, it will return to the controller and wait for the next change. At the same time, the controller gets the return of the chosen sub-policy $πi\pi_{i}$ and compute the reward of its action $a_{t_{c}}^{c}$ as follows:
$r_{t_{c}}^{c} = r_{t_{i}}^{i} + r_{t_{i+1}}^{i}+ ... + r_{t_{i+K-1}}^{i}$
Also, the controller will get the next global state $s_{t_{c+1}}^{c}$ and the tuple will be stored in its local buffer.

这种hierarchical architecture 的优势：

Each sub-policy and the high-level controller have different state space.
The hierarchical structure can also split the tremendous action space A.
The hierarchical architecture can effectively reduce the execution step size of the strategy.

Generation of Macro-actions

The generation process of macro-actions is as follow:

We collect some expert trajectories which are sequence of operations $\in A$ from game replays.
用prefix-span算法mine the relationship of the each operation and combine the related operations to be a sequence of actions $a^{seq}$ of which max length is $C$ and constructed a set $A^{seq}$ which is defined as
$Aseq=(aseq=(a0,a1,a2,...,ai)∣ai∈Aandi⩽C)A^{seq} = (a^{seq} = (a_{0}, a_{1}, a_{2}, ..., a_{i}) \mid a_{i} \in A \quad and \quad i \leqslant C)$
We sort this set by frequency $a^{seq})$
We remove duplicated and meaningless ones, remain the top K ones. Meaningless refers to the sequences like continuous selection or cameras movement
The reduced set are marked as newly generated macro-action space $AηA^{\eta}$ .