放榜了！NeurIPS 2025论文汇总（自动驾驶/大模型/具身/RL等）

最新推荐文章于 2025-10-07 23:59:55 发布

转载最新推荐文章于 2025-10-07 23:59:55 发布 · 852 阅读

CC 4.0 BY-SA版权

原文链接：https://mp.weixin.qq.com/s?__biz=Mzg2NzUxNTU1OA==&mid=2247680826&idx=2&sn=e032b89cfcacea3c5320813a072c644f&chksm=cf8c5d94c1667dbcfbff24f3f68f32aba3876fa91791ae04dce4c9c5da86c6441b2b9a7ee450&scene=126&sessionid=0

文章标签：

#自动驾驶 #人工智能 #机器学习

部署运行你感兴趣的模型镜像

NeurIPS 2025放榜了！自动驾驶之心着手汇总了中稿的相关工作，目前涉及自动驾驶、视觉感知推理、大模型训练、具身智能、强化学习、视频理解、代码生成等等方向！后续的论文更新也会第一时间上传至『自动驾驶之心知识星球』~

自动驾驶

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

paper：https://arxiv.org/abs/2505.17685
code：https://miv-xjtu.github.io/FSDrive.github.io/
单位：阿里、西交

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

paper：https://arxiv.org/abs/2506.13757
code：https://github.com/ucla-mobility/AutoVLA
单位：UCLA

自动驾驶前沿信息一手获取！

Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting

paper：https://arxiv.org/abs/2506.05280
code：https://github.com/BigCiLeng/bilateral-driving
单位：清华AIR、北航等

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

paper：https://arxiv.org/abs/2411.13112
code：https://github.com/XiandaGuo/Drive-MLLM
单位：中科院等

YOLOv12: Attention-Centric Real-Time Object Detectors

paper：https://arxiv.org/abs/2502.12524
code：https://github.com/sunsmarterjie/yolov12
单位：水牛城大学、中科院等

视觉感知推理

OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation

paper: https://arxiv.org/pdf/2509.15096
code：https://github.com/VCIP-RGBD/DFormer
单位：南开@程明明

PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

paper: https://arxiv.org/pdf/2509.02807
code: https://github.com/MSiam/PixFoundation-2.0.git.

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

paper: https://arxiv.org/pdf/2507.16815
code: https://jasper0314-huang.github.io/thinkact-vla/
单位：英伟达、台湾大学

DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding

paper: https://arxiv.org/pdf/2506.10084

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

paper: https://arxiv.org/pdf/2505.18915
code: https://github.com/Schuture/DeepTumorVQA.

视频理解

PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

paper: https://arxiv.org/pdf/2509.02807
code: https://github.com/MSiam/PixFoundation-2.0.git.

图像/视频生成与编辑

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

paper: https://arxiv.org/pdf/2509.15188
code：https://github.com/ybseo-ac/Conv

AutoEdit: Automatic Hyperparameter Tuning for Image Editing

paper: https://arxiv.org/pdf/2509.15031

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

paper: https://arxiv.org/pdf/2505.21448
code：https://ziqiaopeng.github.io/OmniSync/

数据集/评估

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

paper: https://arxiv.org/pdf/2505.18915
code: https://github.com/Schuture/DeepTumorVQA.

3D视觉

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

paper: https://arxiv.org/pdf/2505.18915
code: https://github.com/Schuture/DeepTumorVQA.

大模型训练

Scaling Offline RL via Efficient and Expressive Shortcut Models

paper: https://arxiv.org/pdf/2505.22866
单位：康奈尔大学

强化学习/偏好优化

LLM world models are mental: Output layer evidence of brittle world model use in LLM mechanical reasoning

paper: https://arxiv.org/pdf/2507.15521

Scaling Offline RL via Efficient and Expressive Shortcut Models

paper: https://arxiv.org/pdf/2505.22866

大模型微调

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

paper: https://arxiv.org/pdf/2509.15188

Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning

paper: https://arxiv.org/pdf/2509.15087

Differentially Private Federated Low Rank Adaptation Beyond Fixed-Matrix

paper: https://arxiv.org/pdf/2507.09990

具身智能

Self-Improving Embodied Foundation Models

paper: https://arxiv.org/pdf/2509.15155
code: https://self-improving-efms.github.io
单位：DeepMind

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

paper: https://arxiv.org/pdf/2505.22159
code: https://sites.google.com/view/forcevla2025
单位：复旦、上交等

持续学习/模型幻觉

Improving Multimodal Large Language Models Using Continual Learning

paper: https://arxiv.org/pdf/2410.19925
code: https://shikhar-srivastava.github.io/cl-for-improving-mllms

人体

Real-Time Intuitive AI Drawing System for Collaboration: Enhancing Human Creativity through Formal and Contextual Intent Integration

paper: https://arxiv.org/pdf/2508.19254

大模型安全

Safely Learning Controlled Stochastic Dynamics

paper: https://arxiv.org/pdf/2506.02754

可解释性

Concept-Level Explainability for Auditing & Steering LLM Responses

paper: https://arxiv.org/pdf/2505.07610

文档理解

STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

paper: https://arxiv.org/pdf/2411.00387
code: https://github.com/jiaruzouu/STEM-PoM.

医学

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

paper: https://arxiv.org/pdf/2505.18915
code: https://github.com/Schuture/DeepTumorVQA.

Agent

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

paper: https://arxiv.org/pdf/2506.04018

混合专家模型

Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning

paper: https://arxiv.org/pdf/2509.15087

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

paper: https://arxiv.org/pdf/2505.22159
code: https://sites.google.com/view/forcevla2025.

代码生成

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

paper: https://arxiv.org/pdf/2509.15188

SBSC: Step-By-Step Coding for Improving Mathematical Olympiad Performance

paper: https://arxiv.org/pdf/2502.16666

大模型推理优化

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

paper: https://arxiv.org/pdf/2507.16815
code: https://jasper0314-huang.github.io/thinkact-vla/

LLM world models are mental: Output layer evidence of brittle world model use in LLM mechanical reasoning

paper: https://arxiv.org/pdf/2507.15521

STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

paper: https://arxiv.org/pdf/2411.00387
code: https://github.com/jiaruzouu/STEM-PoM.

扩散模型

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

paper: https://arxiv.org/pdf/2509.15188

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

paper: https://arxiv.org/pdf/2505.21448

其他领域

Out-of-distribution generalisation is hard: evidence from ARC-like tasks

paper: https://arxiv.org/pdf/2505.09716

Fair Summarization: Bridging Quality and Diversity in Extractive Summaries

paper: https://arxiv.org/pdf/2411.07521

您可能感兴趣的与本文相关的镜像

Yolo-v5

Yolo

YOLO（You Only Look Once）是一种流行的物体检测和图像分割模型，由华盛顿大学的Joseph Redmon 和Ali Farhadi 开发。 YOLO 于2015 年推出，因其高速和高精度而广受欢迎