NeurIPS 2025放榜了!自动驾驶之心着手汇总了中稿的相关工作,目前涉及自动驾驶、视觉感知推理、大模型训练、具身智能、强化学习、视频理解、代码生成等等方向!后续的论文更新也会第一时间上传至『自动驾驶之心知识星球』~
自动驾驶
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
paper:https://arxiv.org/abs/2505.17685
code:https://miv-xjtu.github.io/FSDrive.github.io/
单位:阿里、西交

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
paper:https://arxiv.org/abs/2506.13757
code:https://github.com/ucla-mobility/AutoVLA
单位:UCLA

自动驾驶前沿信息一手获取!

Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting
paper:https://arxiv.org/abs/2506.05280
code:https://github.com/BigCiLeng/bilateral-driving
单位:清华AIR、北航等

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models
paper:https://arxiv.org/abs/2411.13112
code:https://github.com/XiandaGuo/Drive-MLLM
单位:中科院等

YOLOv12: Attention-Centric Real-Time Object Detectors
paper:https://arxiv.org/abs/2502.12524
code:https://github.com/sunsmarterjie/yolov12
单位:水牛城大学、中科院等

视觉感知推理
OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
paper: https://arxiv.org/pdf/2509.15096
code:https://github.com/VCIP-RGBD/DFormer
单位:南开@程明明

PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
paper: https://arxiv.org/pdf/2509.02807
code: https://github.com/MSiam/PixFoundation-2.0.git.

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
paper: https://arxiv.org/pdf/2507.16815
code: https://jasper0314-huang.github.io/thinkact-vla/
单位:英伟达、台湾大学

DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding
paper: https://arxiv.org/pdf/2506.10084

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
paper: https://arxiv.org/pdf/2505.18915
code: https://github.com/Schuture/DeepTumorVQA.

视频理解
PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
paper: https://arxiv.org/pdf/2509.02807
code: https://github.com/MSiam/PixFoundation-2.0.git.

图像/视频生成与编辑
Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
paper: https://arxiv.org/pdf/2509.15188
code:https://github.com/ybseo-ac/Conv
AutoEdit: Automatic Hyperparameter Tuning for Image Editing
paper: https://arxiv.org/pdf/2509.15031

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers
paper: https://arxiv.org/pdf/2505.21448
code:https://ziqiaopeng.github.io/OmniSync/

数据集/评估
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
paper: https://arxiv.org/pdf/2505.18915
code: https://github.com/Schuture/DeepTumorVQA.
3D视觉
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
paper: https://arxiv.org/pdf/2505.18915
code: https://github.com/Schuture/DeepTumorVQA.
大模型训练
Scaling Offline RL via Efficient and Expressive Shortcut Models
paper: https://arxiv.org/pdf/2505.22866
单位:康奈尔大学
强化学习/偏好优化
LLM world models are mental: Output layer evidence of brittle world model use in LLM mechanical reasoning
paper: https://arxiv.org/pdf/2507.15521
Scaling Offline RL via Efficient and Expressive Shortcut Models
paper: https://arxiv.org/pdf/2505.22866
大模型微调
Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
paper: https://arxiv.org/pdf/2509.15188
Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning
paper: https://arxiv.org/pdf/2509.15087

Differentially Private Federated Low Rank Adaptation Beyond Fixed-Matrix
paper: https://arxiv.org/pdf/2507.09990

具身智能
Self-Improving Embodied Foundation Models
paper: https://arxiv.org/pdf/2509.15155
code: https://self-improving-efms.github.io
单位:DeepMind

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
paper: https://arxiv.org/pdf/2505.22159
code: https://sites.google.com/view/forcevla2025
单位:复旦、上交等

持续学习/模型幻觉
Improving Multimodal Large Language Models Using Continual Learning
paper: https://arxiv.org/pdf/2410.19925
code: https://shikhar-srivastava.github.io/cl-for-improving-mllms
人体
Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration
paper: https://arxiv.org/pdf/2508.19254
大模型安全
Safely Learning Controlled Stochastic Dynamics
paper: https://arxiv.org/pdf/2506.02754
可解释性
Concept-Level Explainability for Auditing & Steering LLM Responses
paper: https://arxiv.org/pdf/2505.07610
文档理解
STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing
paper: https://arxiv.org/pdf/2411.00387
code: https://github.com/jiaruzouu/STEM-PoM.
医学
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
paper: https://arxiv.org/pdf/2505.18915
code: https://github.com/Schuture/DeepTumorVQA.
Agent
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
paper: https://arxiv.org/pdf/2506.04018
混合专家模型
Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning
paper: https://arxiv.org/pdf/2509.15087

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
paper: https://arxiv.org/pdf/2505.22159
code: https://sites.google.com/view/forcevla2025.
代码生成
Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
paper: https://arxiv.org/pdf/2509.15188
SBSC: Step-By-Step Coding for Improving Mathematical Olympiad Performance
paper: https://arxiv.org/pdf/2502.16666
大模型推理优化
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
paper: https://arxiv.org/pdf/2507.16815
code: https://jasper0314-huang.github.io/thinkact-vla/
LLM world models are mental: Output layer evidence of brittle world model use in LLM mechanical reasoning
paper: https://arxiv.org/pdf/2507.15521
STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing
paper: https://arxiv.org/pdf/2411.00387
code: https://github.com/jiaruzouu/STEM-PoM.
扩散模型
Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
paper: https://arxiv.org/pdf/2509.15188
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers
paper: https://arxiv.org/pdf/2505.21448
其他领域
Out-of-distribution generalisation is hard: evidence from ARC-like tasks
paper: https://arxiv.org/pdf/2505.09716
Fair Summarization: Bridging Quality and Diversity in Extractive Summaries
paper: https://arxiv.org/pdf/2411.07521

959

被折叠的 条评论
为什么被折叠?



