Research Topic
如果要用RL解决StarCraft的challenge: huge state space, varying action space, long horizon, etc.
研究了一种hierarchical approach,involves two levels of abstraction:
- macro-actions
The macro-actions are extracted from expert’s demonstration trajectories, which can reduce the action space in an order of magnitude yet remains effective. - two-layer hierarchical architecture
is modular and easy to scale.
The main contributions of this paper are as follows:
- We investigate a hierarchical architecture which makes large-scale SC2 problem easier to handle.
- A simple yet effective training algorithm for this architecture is also presented.
- We study in detail the impact of different training settings on our architecture.
- Experiments results on SC2LE show that our methods achieves state-of-the-art results.
Method
Hierarchical Architecture
两种policies (controller and sub-policy) running in different timescales.
The controller choose a sub-policy based on current observation every long time interval, and the sub-policy picks a macro-action every short time interval.
The whole process:
- At time tct_{c}tc, the controller gets its own global observation stccs_{t_{c}}^{c}stcc, and it will choose a sub-policy iii based on its state:
atcc=∏(stcc),stcc∈Sca_{t_{c}}^{c} = \prod (s_{t_{c}}^{c}), s_{t_{c}}^{c} \in S_{c}atcc=∏(stcc),stcc∈Sc - The controller will wait for K time units and the iiith sub-policy begins to make its move. We assume its current time is tit_{i}ti and its local observation is stiis_{t_{i}}^{i}stii, so it get the macro-action:
atii=πi(stii)a_{t_{i}}^{i} = \pi_{i}(s_{t_{i}}^{i})atii=πi(stii) - After the iiith sub-policy doing the macro-action atiia_{t_{i}}^{i}atii in the game, it will get the reward and its next local observation, the tuple (stii,atii,rtii,sti+1i)(s_{t_{i}}^{i}, a_{t_{i}}^{i}, r_{t_{i}}^{i}, s_{t_{i+1}}^{i})(stii,atii,rtii,sti+1i) will be stored for the future training.
rtii=Ri(stii,atii)r_{t_{i}}^{i} = R_{i}(s_{t_{i}}^{i}, a_{t_{i}}^{i})rtii=Ri(stii,atii) - After K moves, it will return to the controller and wait for the next change. At the same time, the controller gets the return of the chosen sub-policy πi\pi_{i}πi and compute the reward of its action atcca_{t_{c}}^{c}atcc as follows:
rtcc=rtii+rti+1i+...+rti+K−1ir_{t_{c}}^{c} = r_{t_{i}}^{i} + r_{t_{i+1}}^{i}+ ... + r_{t_{i+K-1}}^{i}rtcc=rtii+rti+1i+...+rti+K−1i
Also, the controller will get the next global state stc+1cs_{t_{c+1}}^{c}stc+1c and the tuple will be stored in its local buffer.
这种hierarchical architecture 的优势:
- Each sub-policy and the high-level controller have different state space.
- The hierarchical structure can also split the tremendous action space A.
- The hierarchical architecture can effectively reduce the execution step size of the strategy.
Generation of Macro-actions
The generation process of macro-actions is as follow:
- We collect some expert trajectories which are sequence of operations a∈Aa \in Aa∈A from game replays.
- 用prefix-span算法mine the relationship of the each operation and combine the related operations to be a sequence of actions aseqa^{seq}aseq of which max length is CCC and constructed a set AseqA^{seq}Aseq which is defined as
Aseq=(aseq=(a0,a1,a2,...,ai)∣ai∈Aandi⩽C)A^{seq} = (a^{seq} = (a_{0}, a_{1}, a_{2}, ..., a_{i}) \mid a_{i} \in A \quad and \quad i \leqslant C)Aseq=(aseq=(a0,a1,a2,...,ai)∣ai∈Aandi⩽C) - We sort this set by frequency (aseq)(a^{seq})(aseq)
- We remove duplicated and meaningless ones, remain the top K ones. Meaningless refers to the sequences like continuous selection or cameras movement
- The reduced set are marked as newly generated macro-action space AηA^{\eta}Aη.