The Information of Large Language Model Geometry

本文深入研究大型语言模型(LLM)的嵌入信息,发现表示熵与模型大小呈幂律关系,并提出基于熵的理论解释。通过信息论和回归分析,揭示了自回归过程中新token与上下文token的关联,指出拉索回归在选择有意义token上的优势,且信息分布广泛而非集中。

本文是LLM系列文章,针对《The Information of Large Language Model Geometry》的翻译。

摘要

本文研究了大型语言模型(LLM)嵌入中编码的信息。我们进行模拟来分析表示熵,并发现与模型大小的幂律关系。基于这一观察结果,我们提出了一个基于(条件)熵的理论来阐明标度律现象。此外,我们深入研究了LLM的自回归结构,并使用信息论和回归技术研究了最后一个token和之前的上下文token之间的关系。具体来说,我们在新token的信息增益和岭回归之间建立了理论联系。此外,我们还探讨了拉索回归在选择有意义的token方面的有效性,它有时优于密切相关的注意力权重。最后,我们进行了受控实验,发现信息分布在token之间,而不是仅集中在特定的“有意义”token中。

1 引言

2 背景

3 LLM中的熵

4 自回归过程中的信息

5 相关工作

6 结论

在本文中,我们研究了在大型语言模型(LLM)嵌入中编码的信息。我们首先模拟了表示熵,发现它与模型大小之间遵循幂律关系。然后,我们提供了一个基于(条件)熵的理论,可以解释熵的标度律。由于现代LLM具有自回归结构,我们使用信息论、高斯过程和回归等工具研究了最后一个token与先前生成的token的关系。我们发现,新token的信息增益在理论上与岭回归有关。此外,受拉索回归和注意力机制之间密切关系的激励,我们发现拉索回归可以选择有意义的token,Lasso选择的token有时甚至比注意力权重更直观,这表明了MLP层在

METHOD 3.1 PRELIMINARY Navigation task definition. The task of Vision-and-Language Navigation (VLN) in continuous environments is defined as follows. At the timestep t, an embodied agent is provided with a natural language instruction I of l words and an ego-centric RGB video OT “ tx0, . . . , xtu, where each frame xt P R3ˆHˆW . The agent’s goal is to predict a low-level action at`1 P A for the subsequent step. The action space is defined as A “ tMove Forward, Turn Left, Turn Right, Stopu. Each low-level action corresponds to a fine-grained physical change: a small rotation (30 ̋), a forward step (25 cm) or stop, which allows for flexible maneuverability in continuous spaces. Upon executing the action at`1, the agent receives a new observation xt`1. This process iterates until the agent executes the Stop action at the target location as specified by the instruction. Visual geometry grounded transformer (VGGT). Building upon traditional 3D reconstruction, recent learning-based end-to-end methods (Wang et al., 2025b; Ding et al., 2025a) employ neural networks to encode scene priors, directly predicting 3D structures from multi-view images. VGGT (Wang et al., 2025b), which is based on a transformer feed-forward architecture, comprises three key components: an encoder for extracting single-image feature, a fusion decoder for crossframe interaction to generate geometric tokens Gt P Rt H p uˆt W p uˆC , where p is the patch size, and a task-specific prediction head for 3D attributes. The reconstruction pipeline can be formulated as: tGtuT t“1 “ DecoderpEncoderptxtuT t“1qq, pPt, Ctq “ HeadpGtq, (1) where a Multi-Layer Perceptron (MLP) head predicts a point map Pt P R3ˆHˆW and a per-pixel confidence map Ct P RHˆW from these geometric tokens. As our focus is on feature extraction, which embeds 3D geometry prior information, rather than directly outputting 3D attributes, we leverage the encoder and the fusion decoder as our 3D visual geometry encoder. 3.2 DUAL IMPLICIT MEMORY The limitations of traditional explicit semantic memory, including memory inflation, computational redundancy, and the loss of spatial information, coupled with the original VGGT’s requirement to reprocess the entire sequence for each new frame, impede the real-time performance and effectiveness of streaming navigation. To address these challenges, we introduce the VGGT as a spatial geometry encoder and propose a novel dual implicit memory paradigm for VLN research in Figure 2. This paradigm models spatial geometry and visual semantics as fixed-size, compact neural representations by respectively leveraging the history initial and sliding window KV cache of the dual encoders. The spatial memory within the spatial geometry encoder is modeled as follows: Implicit neural representation. In contrast to previous methods that store high-dimensional, unprocessed, and explicit historical frames, we innovatively caches historical KV M that have been deeply processed by neural networks. These KV, derived from the output of attention modules such as transformers, constitute high-level semantic abstractions and structured representations of the past environment. This implicit memory is not merely a compact, efficient storage entity, but a condensed knowledge representation refined by the neural networks. It enables the agent to retrieve and reason over information with minimal computational cost. Record Instruction: Turn right and walk towards the door... Large Language Model Action: 3D Spatial Geometry Encoder 2D Visual Semantic Encoder Attention Fusion Attention Fusion Physical World Stream Video Dual Implicit Memory ... 2D Visual Semantic Tokens 3D Spatial Geometry Tokens ... Sliding Window tt ... Sliding Window ... ... Initial Window ... Initial Window Figure 2: The framework of JanusVLN. Given an RGB-only video stream and navigation instructions, JanusVLN utilizes a dual-encoder to separately extract visual-semantic and spatial-geometric features. It concurrently caches historical key-values from initial and recent sliding window into a dual implicit memory to facilitate feature reuse and prevent redundant computation. Finally, these two complementary features are fused and fed into LLM to predict the next action. Hybrid incremental update. For the implicit neural representation, we employ a hybrid cache update strategy instead of caching all historical KV. This approach mitigates the significant memory overhead and performance degradation that arise from extended navigation sequences. The strategy partitions the memory into two components. The first is a sliding window queue Msliding with a capacity of n, which stores the KV caches of the most recent n frames in a first-in, first-out manner. This mechanism ensures the model focuses on the most immediate and relevant contextual information, which is critical for real-time decision-making. When this queue reaches its capacity, the oldest frame’s cache is evicted to accommodate the current frame, enabling dynamic incremental updates. The second component permanently retains the KV cache Minitial from the initial few frames. The model exhibits sustained high attention weights towards these initial frames, which function as ”Attention Sinks” (Xiao et al., 2024; Li et al., 2025c). These sinks provide critical global anchors for the entire navigation and effectively restore performance. By integrating these two mechanisms, we construct a dynamically updated, fixed-size implicit memory that preserves an acute perception of the recent environment while maintaining a long-term memory of information. For each incoming new frame, we compute cross-attention between its image tokens and the implicit memory to directly retrieve historical information, thereby obviating the need for redundant feature extraction from past frames. Gt “ DecoderpCrossAttnpEncoderpxtq, tMinitial, Mslidinguqq. (2) Figure 3: Inference time comparison for the current frame of varying sequence lengths. As shown in Figure 3, VGGT’s inference time grows exponentially with each new frame due to its need to reprocess the entire sequence, resulting in an out-of-memory error on 48G GPU with only 48 frames. In contrast, our approach avoids reprocessing historical frames, causing its inference time to increase only marginally and thereby demonstrating excellent efficiency. For semantic encoder and LLM, we similarly retain the KV from the initial and sliding window. Moreover, these implicit memory and tokens can be visualized to inspect the spatial and semantic information they contain. 3.3 JANUSVLN ARCHITECTURE Building upon the dual implicit memory paradigm, we propose JanusVLN in Figure 2, enhances the spatial understanding capabilities without requiring costly 3D data (e.g., depth). Decoupling visual perception: semantics and spatiality. To equip embodied agents with the dual capabilities of semantic understanding (”what it is”) and spatial awareness (”where it is and how it’s related”), JanusVLN is proposed as a dual-encoder architecture that decouples semantic and spatial information from visual inputs. For 2D semantic encoder, we adopt the original visual encoder from Qwen2.5-VL to interactively encode the input frame xt with the semantic memory into a semantic tokens: St “ Encodersempxtq, St P Rt H p uˆt W p uˆC . (3) Additionally, Qwen2.5-VL (Bai et al., 2025) groups spatially adjacent 2ˆ2 patches into a single image token to reduce computational cost, yielding St1 P Rt H 2p uˆt W 2p uˆC . For 3D spatial-geometric encoder, we employ the pre-trained encoder and fusion decoder from VGGT (Wang et al., 2025b) model to interactively encode the input frame with spatial memory into spatial-geometric token Gt. Spatial-aware feature fusion. Upon acquiring the semantic features St1 and spatial geometric features Gt, we first employ the spatial merging strategy from Qwen2.5-VL (Bai et al., 2025). This strategy concatenates spatially adjacent 2ˆ2 feature blocks within Gt to form G1t P Rt H 2p uˆt W 2p uˆC , thereby aligning its shape with that of St1. Subsequently, we utilize a lightweight two-layer MLP projection layer to fuse the semantic and spatial geometric information: Ft “ S1 t ` λ ̊ M LP pG1 tq, (4) where λ represents the weight for the spatial geometric features, and Ft denotes the final, spatiallygeometrically enhanced visual features. Subsequently, the final visual features, along with the text embedding of instruction I, are fed into the backbone of the MLLM to generate the next action.详细解释一下
10-25
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值