StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue
manner. Built on LLaVA-Video [2] as the foundational Video-LLM, we extend it for interleaved
vision, language, and action modeling. The overall framework of StreamVLN is shown in Figure 1.
Webriefly introduce the autoregressive generation in continuous multi-turn dialogues for a streaming
VLN process (Section 3.1). For both effective context modeling of long sequence and efficient
computation for real-time interaction, StreamVLN has: (1) a fast-streaming dialogue context with
a sliding-window KV cache (Section 3.2); and (2) a slow-updating memory via token pruning
(Section 3.3). Finally, we describe how we curate the navigation data and incorporate diverse
multimodal data for multi-task training (Section 3.4).
2
… Vision Encoder
Projector
Large Language Model
…
KV Cache
Timeline
Temporal
Sampling
Instruction Token
Observation Token
Output Action Token
Pruned Token
Inactive / Current
Sliding Window
Voxel-based
Spatial Pruning
Figure1:FrameworkofStreamVLN.Theinputconsistsofalanguageinstructionandastreamof
RGBimages.Eachnavigationepisodeisframedasamulti-turndialogue,wheretheagentcontinually
queriesforthenextactions. Tosupport long-horizonreasoningwhilemaintainingamanageable
contextsizeandlowlatency,weadoptafixed-sizeslidingwindowtoretainrecentdialoguehistory.
Thecontextininactivewindowsisupdatedbytokenpruningtotoreducememoryoverhead.
3.1 Preliminary:ContinuousMulti-TurnAutoregressiveGeneration
Amulti-turndialoguesessionforVLNconsistsofasequenceofinterleavedobservationsandactions.
Ineachdialoguedi=(oi,ai), theVLNmodel receivesanewobservationoi andproducesan
actionresponseaiconditionedonboththecurrent inputandthedialoguehistory. Thefull input
sequenceatstepiisconstructedas:o1a1o2a2...oi−1ai−1. Inthisstreamingsetting,newtokensfrom
oiareappendedtothetokenstreamcontinuously.Theresponseai isgeneratedtoken-by-tokenvia
autoregressivedecoding.Foreachdialogueturn,Transformer-basedLLMsfirstperformaprefill
phasetoencodeinputtokens,cachingtheirkey/value(KV)statesinattentionlayers.Thesecached
KVpairsarethenusedinthedecodingphasetogeneratenewtokens. Ifwedon’tuseKVcache
acrossturns,themodelwillrepeatthisprefillingprocessofallprevioustokensforanewdialogue.
3.2 Fast-StreamingDialogueContext
Whilemulti-turnKVcachereusecaneliminateover99%ofprefillingtime,itintroducessubstantial
memoryoverhead.Asthenumberofdialoguesincreases, theKVcachegrowslinearly(e.g.,2K
tokenscanconsumearound5GBofmemory),makinglongsessionsimpractical. Inaddition,existing
Video-LLMstendtoexhibitdegradedreasoningperformancewhenprocessingoverlylongcontexts.
Tomanagedialoguecontext,weadoptaslidingwindowKVcacheovercontinuousdialogues,re
tainingafixednumberNofrecentdialoguesinanactivewindow:Wj=[o(i−N+1)a(i−N+1)...oiai]
Whenthewindowreachescapacity,thekey/valuestatesareoffloadedfromtheLLM,andthestatesof
non-observationdialoguetokens,suchaspromptsandgeneratedactions,areimmediatelydiscarded.
Forthenewslidingwindow,thetokenstatesfrompastwindowsareprocessedintomemorytoken
states{M0,...,Mj}(asdetailedinSection3.3).Formally,forthelatestobservationoi,thedecoder
generatesaibasedonthecachedtokenstatesandthecurrentwindow’sKVcache:
aWj+1
i =Decoder oi,{M0,...,Mj},{k(i−N+1)v(i−N+1),...,k(i−1)v(i−1)} .
3
3.3 Slow-Updating Memory Context
Balancing temporal resolution and fine-grained spatial perception within a limited context length
remains a key challenge for Video-LLMs. Rather than compressing video tokens at the feature level
(e.g., through average pooling), which hinders the reuse of the KV cache from previous dialogues, we
retain high image resolution while selectively discarding spatially and temporally redundant tokens.
Wefind that this approach better preserves the transferability of Video-LLMs.
To reduce the temporal redundancy,
we adopt a simple fixed-number sam
pling strategy following [5], as vary
ing lengths of memory tokens may in
duce a temporal duration bias, reduce
the model’s robustness across differ
ent planning horizons. To further
eliminate spatial redundancy across
frames, we design a voxel-based spa
tial pruning strategy. Specifically, we
back-project the 2D image patches
from the video stream into a shared
3Dspace using depth information. By
Algorithm 1 Voxel-Based Spatial Pruning
1: Voxel map V ∈ ZT×H×W,stride K, threshold θ
2: Pruning Mask M ∈ {0,1}T×H×W
3: Initialize M ← 0, map latest ← ∅
4: for each token (t,x,y) with Vt,x,y ≥ 0 do
5:
6:
7:
p ←⌊t/K⌋, v ←Vt,x,y
if (p, v) not in latest or t is newer then
latest[(p,v)] ← (t,x,y)
end if
8:
9: end for
10: Set Mt,x,y ← 1 for all (t,x,y) ∈ latest
11: For each t, if x,y Mt,x,y < θ · H ·W, set Mt,:,: ← 0
12: return M
discretizing this 3D space into uni
form voxels, we can track the voxel indices of the patch tokens over time. If multiple tokens
from different frames within a given duration are projected into the same voxel, only the token from
the most recent observation is retained, as detailed in Algorithm 1. The voxel pruning mask M is
then used to select the preserved token states.
3.4 Co-Training with Multi-Source Data.
Vision-Language Action Data. We collect navigation-specific
training data using the Habitat simulator across multiple pub
lic VLN datasets. Specifically, we collect 450K samples
(video clips) from 60 Matterport3D [25] (MP3D) environments,
sourced from R2R [7], R2R-EnvDrop [26] and RxR [8]. To
further improve generalization through increased scene diver
sity, we incorporate an additional 300K samples from a subset
of ScaleVLN [19], spanning 700 Habitat Matterport3D [27]
(HM3D) scenes. In addition, we adopt the DAgger [28] algo
rithm to enhance the model’s robustness and generalization abil
ity in novel scenes and during error recovery. Using Habitat’s
shortest-path follower as the expert policy, we collect corrective
demonstrations on model rollouts after the initial training stage.
These DAgger-collected samples (240K) are then incorporated
MMC4
16%
VQA
17%
General
Multi-modal
33%
DAgger
16%
MP3D
31%
VLA
67%
HM3D
20%
Figure 2: Co-Training Data Recipe
of StreamVLN
into the training set for co-training.
General Vision-Language Data. To retain the general reasoning capabilities of the pretrained
Video-LLM, we incorporate a diverse set of multimodal training data that complements navigation
supervision. Specifically, we include 248K video-based visual question-answering (VQA) samples
sourced from publicly available datasets LLaVA-Video-178K [29] and ScanQA [30], which combine
general video QA with 3D scene understanding to support spatial-temporal and geometric reasoning.
To further augment the model’s capacity for multi-turn vision-language interactions, we incorporate
230K interleaved image-text samples from MMC4 [31], which strengthens its ability to parse and
generate contextually coherent responses with interleaved visual and textual reasoning.详细解释一下