Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

数学推理能否提升大语言模型通用能力?

在这里插入图片描述

文章主要内容总结

本文聚焦于大型语言模型(LLMs)的数学推理能力是否能迁移到其他领域,以探究模型在数学任务上的提升是否反映了通用问题解决能力,而非仅针对特定任务的过拟合。

  1. 研究背景:近年来,LLMs在数学推理基准(如MATH、AIME)上的表现快速提升,甚至超越人类水平,但数学推理能力的提升是否能迁移到其他领域尚不明确。
  2. 实验设计
    • 评估了20多个开源推理调优模型,覆盖数学推理、科学问答、代理规划、编码、指令遵循等任务。
    • 提出“迁移指数(Transferability Index)”量化模型从数学领域到其他推理任务和非推理任务的能力迁移。
    • 以Qwen3-14B为基础模型,通过控制实验比较仅用数学数据的强化学习(RL)和监督微调(SFT)的效果。
  3. 核心发现
    • 多数数学表现优异的模型难以将能力迁移到其他领域。
    • RL调优模型在跨领域(包括推理和非推理任务)泛化能力更强,而SFT调优模型常出现“灾难性遗忘”,丢失通用能力。
    • 机制分析显示:SFT导致模型潜在空间表示和输出分布显著漂移,而RL更好地保留了通用领域的结构稳定性。
StreamVLN generates action outputs from continuous video input in an online, multi-turn dialogue manner. Built on LLaVA-Video [2] as the foundational Video-LLM, we extend it for interleaved vision, language, and action modeling. The overall framework of StreamVLN is shown in Figure 1. Webriefly introduce the autoregressive generation in continuous multi-turn dialogues for a streaming VLN process (Section 3.1). For both effective context modeling of long sequence and efficient computation for real-time interaction, StreamVLN has: (1) a fast-streaming dialogue context with a sliding-window KV cache (Section 3.2); and (2) a slow-updating memory via token pruning (Section 3.3). Finally, we describe how we curate the navigation data and incorporate diverse multimodal data for multi-task training (Section 3.4). 2 … Vision Encoder Projector Large Language Model … KV Cache Timeline Temporal Sampling Instruction Token Observation Token Output Action Token Pruned Token Inactive / Current Sliding Window Voxel-based Spatial Pruning Figure1:FrameworkofStreamVLN.Theinputconsistsofalanguageinstructionandastreamof RGBimages.Eachnavigationepisodeisframedasamulti-turndialogue,wheretheagentcontinually queriesforthenextactions. Tosupport long-horizonreasoningwhilemaintainingamanageable contextsizeandlowlatency,weadoptafixed-sizeslidingwindowtoretainrecentdialoguehistory. Thecontextininactivewindowsisupdatedbytokenpruningtotoreducememoryoverhead. 3.1 Preliminary:ContinuousMulti-TurnAutoregressiveGeneration Amulti-turndialoguesessionforVLNconsistsofasequenceofinterleavedobservationsandactions. Ineachdialoguedi=(oi,ai), theVLNmodel receivesanewobservationoi andproducesan actionresponseaiconditionedonboththecurrent inputandthedialoguehistory. Thefull input sequenceatstepiisconstructedas:o1a1o2a2...oi−1ai−1. Inthisstreamingsetting,newtokensfrom oiareappendedtothetokenstreamcontinuously.Theresponseai isgeneratedtoken-by-tokenvia autoregressivedecoding.Foreachdialogueturn,Transformer-basedLLMsfirstperformaprefill phasetoencodeinputtokens,cachingtheirkey/value(KV)statesinattentionlayers.Thesecached KVpairsarethenusedinthedecodingphasetogeneratenewtokens. Ifwedon’tuseKVcache acrossturns,themodelwillrepeatthisprefillingprocessofallprevioustokensforanewdialogue. 3.2 Fast-StreamingDialogueContext Whilemulti-turnKVcachereusecaneliminateover99%ofprefillingtime,itintroducessubstantial memoryoverhead.Asthenumberofdialoguesincreases, theKVcachegrowslinearly(e.g.,2K tokenscanconsumearound5GBofmemory),makinglongsessionsimpractical. Inaddition,existing Video-LLMstendtoexhibitdegradedreasoningperformancewhenprocessingoverlylongcontexts. Tomanagedialoguecontext,weadoptaslidingwindowKVcacheovercontinuousdialogues,re tainingafixednumberNofrecentdialoguesinanactivewindow:Wj=[o(i−N+1)a(i−N+1)...oiai] Whenthewindowreachescapacity,thekey/valuestatesareoffloadedfromtheLLM,andthestatesof non-observationdialoguetokens,suchaspromptsandgeneratedactions,areimmediatelydiscarded. Forthenewslidingwindow,thetokenstatesfrompastwindowsareprocessedintomemorytoken states{M0,...,Mj}(asdetailedinSection3.3).Formally,forthelatestobservationoi,thedecoder generatesaibasedonthecachedtokenstatesandthecurrentwindow’sKVcache: aWj+1 i =Decoder oi,{M0,...,Mj},{k(i−N+1)v(i−N+1),...,k(i−1)v(i−1)} . 3 3.3 Slow-Updating Memory Context Balancing temporal resolution and fine-grained spatial perception within a limited context length remains a key challenge for Video-LLMs. Rather than compressing video tokens at the feature level (e.g., through average pooling), which hinders the reuse of the KV cache from previous dialogues, we retain high image resolution while selectively discarding spatially and temporally redundant tokens. Wefind that this approach better preserves the transferability of Video-LLMs. To reduce the temporal redundancy, we adopt a simple fixed-number sam pling strategy following [5], as vary ing lengths of memory tokens may in duce a temporal duration bias, reduce the model’s robustness across differ ent planning horizons. To further eliminate spatial redundancy across frames, we design a voxel-based spa tial pruning strategy. Specifically, we back-project the 2D image patches from the video stream into a shared 3Dspace using depth information. By Algorithm 1 Voxel-Based Spatial Pruning 1: Voxel map V ∈ ZT×H×W,stride K, threshold θ 2: Pruning Mask M ∈ {0,1}T×H×W 3: Initialize M ← 0, map latest ← ∅ 4: for each token (t,x,y) with Vt,x,y ≥ 0 do 5: 6: 7: p ←⌊t/K⌋, v ←Vt,x,y if (p, v) not in latest or t is newer then latest[(p,v)] ← (t,x,y) end if 8: 9: end for 10: Set Mt,x,y ← 1 for all (t,x,y) ∈ latest 11: For each t, if x,y Mt,x,y < θ · H ·W, set Mt,:,: ← 0 12: return M discretizing this 3D space into uni form voxels, we can track the voxel indices of the patch tokens over time. If multiple tokens from different frames within a given duration are projected into the same voxel, only the token from the most recent observation is retained, as detailed in Algorithm 1. The voxel pruning mask M is then used to select the preserved token states. 3.4 Co-Training with Multi-Source Data. Vision-Language Action Data. We collect navigation-specific training data using the Habitat simulator across multiple pub lic VLN datasets. Specifically, we collect 450K samples (video clips) from 60 Matterport3D [25] (MP3D) environments, sourced from R2R [7], R2R-EnvDrop [26] and RxR [8]. To further improve generalization through increased scene diver sity, we incorporate an additional 300K samples from a subset of ScaleVLN [19], spanning 700 Habitat Matterport3D [27] (HM3D) scenes. In addition, we adopt the DAgger [28] algo rithm to enhance the model’s robustness and generalization abil ity in novel scenes and during error recovery. Using Habitat’s shortest-path follower as the expert policy, we collect corrective demonstrations on model rollouts after the initial training stage. These DAgger-collected samples (240K) are then incorporated MMC4 16% VQA 17% General Multi-modal 33% DAgger 16% MP3D 31% VLA 67% HM3D 20% Figure 2: Co-Training Data Recipe of StreamVLN into the training set for co-training. General Vision-Language Data. To retain the general reasoning capabilities of the pretrained Video-LLM, we incorporate a diverse set of multimodal training data that complements navigation supervision. Specifically, we include 248K video-based visual question-answering (VQA) samples sourced from publicly available datasets LLaVA-Video-178K [29] and ScanQA [30], which combine general video QA with 3D scene understanding to support spatial-temporal and geometric reasoning. To further augment the model’s capacity for multi-turn vision-language interactions, we incorporate 230K interleaved image-text samples from MMC4 [31], which strengthens its ability to parse and generate contextually coherent responses with interleaved visual and textual reasoning.详细解释一下
10-23
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值