IV. METHOD
In this section, we first present our formulation of VLN
as a multi-turn process (Section IV-A). We then describe
the ActiveVLN framework (Section IV-B), which enables
learning from self-generated trajectories through active ex
ploration (Section IV-B.2) and employs a dynamic early
stopping strategy to enhance RL training efficiency (Sec
tion IV-B.3). Finally, we provide engineering details for
further acceleration of RL training (Section IV-C).
A. Multi-Turn Paradigm
Following video-based MLLMs, most prior end-to-end
VLN models adopt the single-turn paradigm (Figure 2a),
where each action is predicted from the instruction and past
observations:
at ∼ πθ(at | I,V1:t).
(3)
In contrast, we adopt the multi-turn paradigm (Figure 2b),
where actions are modeled autoregressively from both past
observations and actions:
at ∼ πθ at | I,{(Vi,ai)}t−1
i=1, Vt .
(4)
Fig. 1: Overview of ActiveVLN. In Stage 1, ActiveVLN performs imitation learning (IL) using expert trajectories. In Stage 2, it conducts
multi-turn reinforcement learning (RL), autonomously collecting trajectories in the simulator, receiving rewards that encourage progress
toward the goal, and updating the policy via GRPO. Key components, including the dynamic early-stopping strategy, scene preloading,
and scene caching, are incorporated to ensure efficient training during RL.
(a) Single-turn paradigm: each action is predicted from the instruction and
past observations only.
(b) Multi-turn paradigm: actions are generated autoregressively from the
instruction, past observations, and past actions. This allows training signals
from future steps to backpropagate and refine earlier predictions.
(see Section V-F.2).
B. ActiveVLN Framework
1) Compact Action Prediction: The raw VLN-CE action
space comprises four primitive actions: FORWARD, TURN
LEFT, TURN RIGHT, and STOP. Prior work augments this
space by merging consecutive actions of the same type [3]
[5]. For instance, three FORWARD steps (each 25cm) can
be merged into a single FORWARD 75cm action. This aug
mentation diversifies action granularity and shortens episode
length, improving training efficiency. Building on this, we
adopt action chunking to further reduce episode length,
where the agent predicts up to three future actions at once
rather than a single action per step.
2) Active Exploration: Since MLLMs are not pre-trained
for VLN tasks, we start with imitation learning (IL) using
a small number of expert demonstrations to initialize the
navigation policy. The IL objective is:
nt
LIL = −
t
i=1
log P(ai
t | I,H<t,a<i
t ),
(5)
where I is the navigation instruction, H<t is the interaction
history up to step t, and a<i
t
are the actions already generated
in the current chunk. Formally, the history is:
Fig. 2: Comparison between the single-turn and multi-turn
paradigms.
H<t ={V1,A1,V2,A2,...,Vt−1,At−1,Vt},
(6)
where Vt is the observation at step t, and At =
{a1
t, a2
t, . . . , ant
This paradigm offers several advantages. First, it naturally
enables KV-cache reuse for efficient inference. Second, while
the single-turn paradigm breaks actions within the same
episode into independent pieces, the multi-turn formulation
allows actions to be packed together, accelerating training.
Most importantly, it enables gradients associated with trajec
tory outcomes to be propagated to all preceding actions. We
f
ind this property to be crucial for the success of RL in VLN
t } is a sequence of low-level actions.
IL provides a good starting point, but it has a key lim
itation: the agent only learns to mimic expert trajectories.
Once it encounters unfamiliar situations, it has no mechanism
to recover or adapt. To overcome this, we introduce active
exploration through reinforcement learning (RL). Here, the
agent is no longer restricted to expert data. Instead, it can
interact with the environment on its own: given an instruc
tion, it predicts an action, executes it in the simulator, and
observes the outcome. By repeating this loop until it issues
a stop action (or an exception occurs), the agent actively
generates diverse trajectories, learns from its successes and
failures, and gradually improves its policy.
For optimization, we use Group Relative Policy Optimiza
tion (GRPO) [31]. GRPO samples G candidate trajectories
for the same instruction and compares them within the
group. Trajectories that perform better than the group average
are reinforced, while weaker ones are suppressed. The RL
objective is:
LRL = E{oi}
1
clip
G
G
|oi|
1
|oi|
t=1
min πθ
πθold
Ai,t,
i=1
πθ
πθold
, 1 − ϵ,1 + ϵ Ai,t ,
(7)
where oi is the i-th trajectory with length |oi|, πθ and πθold
are the new and old policies, ϵ is the clipping parameter, and
Ai,t is the estimated advantage.
We use a soft success reward:
R=15·I(success) · dgoal
3 ,
(8)
where dgoal is the geodesic distance to the goal. The indicator
I(success) = 1 if the agent issues a valid stop within 3
meters of the goal, and 0 otherwise. In this way, the agent is
no longer just imitating experts but is encouraged to explore
actively, discover different ways of reaching the goal, and
improve by trial and error. This self-driven learning process
is the key to stronger generalization in unseen environments.
3) Dynamic Early-Stopping Strategy: Trajectory rollout
time can account for over half of total RL training time,
making it the primary bottleneck. We observe that exces
sively long trajectories often dominate this time, and in most
cases correspond to unsuccessful attempts in which the agent
wanders aimlessly or explores irrelevant regions.
To address this issue, we introduce a dynamic early
stopping strategy that adaptively terminates unpromising
rollouts. Specifically, a trajectory is stopped and marked as
failed once its length exceeds a threshold Tmax, defined as:
Tmax = αroll · |τ∗|,
(9)
where |τ∗| is the length of the oracle (expert trajectory),
and αroll > 1 is a tolerance factor that specifies how much
deviation from the oracle is acceptable (we set αroll = 2 in
our experiments).
This strategy avoids wasting computation on hopeless
rollouts, while maintaining a balance between preventing
excessive exploration and avoiding overly strict cutoffs,
ultimately leading to more efficient training.
C. Engineering Details in RL
To further accelerate RL training, we adopt several tech
niques: Scene caching stores frequently accessed scene data
in memory, enabling faster loading when the same scene
is revisited. Scene preloading pipelines scene loading with
policy updates, reducing idle time during training. These
techniques cut down scene-loading overhead and improve
training efficiency. In addition, similar to [35], we decouple
the simulator from the training server by deploying it as a
standalone HTTP server, which allows scalable and parallel
execution of multiple navigation environments.详细解释一下