METHOD 3.1 PRELIMINARY Navigation task definition. The task of Vision-and-Language Navigation (VLN) in continuous environments is defined as follows. At the timestep t, an embodied agent is provided with a natural language instruction I of l words and an ego-centric RGB video OT “ tx0, . . . , xtu, where each frame xt P R3ˆHˆW . The agent’s goal is to predict a low-level action at`1 P A for the subsequent step. The action space is defined as A “ tMove Forward, Turn Left, Turn Right, Stopu. Each low-level action corresponds to a fine-grained physical change: a small rotation (30 ̋), a forward step (25 cm) or stop, which allows for flexible maneuverability in continuous spaces. Upon executing the action at`1, the agent receives a new observation xt`1. This process iterates until the agent executes the Stop action at the target location as specified by the instruction. Visual geometry grounded transformer (VGGT). Building upon traditional 3D reconstruction, recent learning-based end-to-end methods (Wang et al., 2025b; Ding et al., 2025a) employ neural networks to encode scene priors, directly predicting 3D structures from multi-view images. VGGT (Wang et al., 2025b), which is based on a transformer feed-forward architecture, comprises three key components: an encoder for extracting single-image feature, a fusion decoder for crossframe interaction to generate geometric tokens Gt P Rt H p uˆt W p uˆC , where p is the patch size, and a task-specific prediction head for 3D attributes. The reconstruction pipeline can be formulated as: tGtuT t“1 “ DecoderpEncoderptxtuT t“1qq, pPt, Ctq “ HeadpGtq, (1) where a Multi-Layer Perceptron (MLP) head predicts a point map Pt P R3ˆHˆW and a per-pixel confidence map Ct P RHˆW from these geometric tokens. As our focus is on feature extraction, which embeds 3D geometry prior information, rather than directly outputting 3D attributes, we leverage the encoder and the fusion decoder as our 3D visual geometry encoder. 3.2 DUAL IMPLICIT MEMORY The limitations of traditional explicit semantic memory, including memory inflation, computational redundancy, and the loss of spatial information, coupled with the original VGGT’s requirement to reprocess the entire sequence for each new frame, impede the real-time performance and effectiveness of streaming navigation. To address these challenges, we introduce the VGGT as a spatial geometry encoder and propose a novel dual implicit memory paradigm for VLN research in Figure 2. This paradigm models spatial geometry and visual semantics as fixed-size, compact neural representations by respectively leveraging the history initial and sliding window KV cache of the dual encoders. The spatial memory within the spatial geometry encoder is modeled as follows: Implicit neural representation. In contrast to previous methods that store high-dimensional, unprocessed, and explicit historical frames, we innovatively caches historical KV M that have been deeply processed by neural networks. These KV, derived from the output of attention modules such as transformers, constitute high-level semantic abstractions and structured representations of the past environment. This implicit memory is not merely a compact, efficient storage entity, but a condensed knowledge representation refined by the neural networks. It enables the agent to retrieve and reason over information with minimal computational cost.
Record Instruction: Turn right and walk towards the door... Large Language Model Action: 3D Spatial Geometry Encoder 2D Visual Semantic Encoder Attention Fusion Attention Fusion Physical World Stream Video Dual Implicit Memory ... 2D Visual Semantic Tokens 3D Spatial Geometry Tokens ... Sliding Window tt ... Sliding Window ... ... Initial Window ... Initial Window Figure 2: The framework of JanusVLN. Given an RGB-only video stream and navigation instructions, JanusVLN utilizes a dual-encoder to separately extract visual-semantic and spatial-geometric features. It concurrently caches historical key-values from initial and recent sliding window into a dual implicit memory to facilitate feature reuse and prevent redundant computation. Finally, these two complementary features are fused and fed into LLM to predict the next action. Hybrid incremental update. For the implicit neural representation, we employ a hybrid cache update strategy instead of caching all historical KV. This approach mitigates the significant memory overhead and performance degradation that arise from extended navigation sequences. The strategy partitions the memory into two components. The first is a sliding window queue Msliding with a capacity of n, which stores the KV caches of the most recent n frames in a first-in, first-out manner. This mechanism ensures the model focuses on the most immediate and relevant contextual information, which is critical for real-time decision-making. When this queue reaches its capacity, the oldest frame’s cache is evicted to accommodate the current frame, enabling dynamic incremental updates. The second component permanently retains the KV cache Minitial from the initial few frames. The model exhibits sustained high attention weights towards these initial frames, which function as ”Attention Sinks” (Xiao et al., 2024; Li et al., 2025c). These sinks provide critical global anchors for the entire navigation and effectively restore performance. By integrating these two mechanisms, we construct a dynamically updated, fixed-size implicit memory that preserves an acute perception of the recent environment while maintaining a long-term memory of information. For each incoming new frame, we compute cross-attention between its image tokens and the implicit memory to directly retrieve historical information, thereby obviating the need for redundant feature extraction from past frames. Gt “ DecoderpCrossAttnpEncoderpxtq, tMinitial, Mslidinguqq. (2) Figure 3: Inference time comparison for the current frame of varying sequence lengths. As shown in Figure 3, VGGT’s inference time grows exponentially with each new frame due to its need to reprocess the entire sequence, resulting in an out-of-memory error on 48G GPU with only 48 frames. In contrast, our approach avoids reprocessing historical frames, causing its inference time to increase only marginally and thereby demonstrating excellent efficiency. For semantic encoder and LLM, we similarly retain the KV from the initial and sliding window. Moreover, these implicit memory and tokens can be visualized to inspect the spatial and semantic information they contain.
3.3 JANUSVLN ARCHITECTURE Building upon the dual implicit memory paradigm, we propose JanusVLN in Figure 2, enhances the spatial understanding capabilities without requiring costly 3D data (e.g., depth). Decoupling visual perception: semantics and spatiality. To equip embodied agents with the dual capabilities of semantic understanding (”what it is”) and spatial awareness (”where it is and how it’s related”), JanusVLN is proposed as a dual-encoder architecture that decouples semantic and spatial information from visual inputs. For 2D semantic encoder, we adopt the original visual encoder from Qwen2.5-VL to interactively encode the input frame xt with the semantic memory into a semantic tokens: St “ Encodersempxtq, St P Rt H p uˆt W p uˆC . (3) Additionally, Qwen2.5-VL (Bai et al., 2025) groups spatially adjacent 2ˆ2 patches into a single image token to reduce computational cost, yielding St1 P Rt H 2p uˆt W 2p uˆC . For 3D spatial-geometric encoder, we employ the pre-trained encoder and fusion decoder from VGGT (Wang et al., 2025b) model to interactively encode the input frame with spatial memory into spatial-geometric token Gt. Spatial-aware feature fusion. Upon acquiring the semantic features St1 and spatial geometric features Gt, we first employ the spatial merging strategy from Qwen2.5-VL (Bai et al., 2025). This strategy concatenates spatially adjacent 2ˆ2 feature blocks within Gt to form G1t P Rt H 2p uˆt W 2p uˆC , thereby aligning its shape with that of St1. Subsequently, we utilize a lightweight two-layer MLP projection layer to fuse the semantic and spatial geometric information: Ft “ S1 t ` λ ̊ M LP pG1 tq, (4) where λ represents the weight for the spatial geometric features, and Ft denotes the final, spatiallygeometrically enhanced visual features. Subsequently, the final visual features, along with the text embedding of instruction I, are fed into the backbone of the MLLM to generate the next action.详细解释一下