资料汇总 | VLM-世界模型-端到端

最新推荐文章于 2025-07-07 13:05:52 发布

自动驾驶之心

最新推荐文章于 2025-07-07 13:05:52 发布

阅读量9

点赞数

CC 4.0 BY-SA版权

原文链接：https://mp.weixin.qq.com/s?__biz=Mzg2NzUxNTU1OA==&mid=2247670291&idx=3&sn=987f32e3ba0d66b51a110bec786ebe39&chksm=cf45fb9659aa6b901f8af4458b62aae78a4fbbb9d55f792a4004c293876fc9baff2a802e94a3&scene=126&sessionid=0

作者 | qian 编辑 | 自动驾驶之心

原文链接：https://zhuanlan.zhihu.com/p/1922228114404143784

点击下方卡片，关注“自动驾驶之心”公众号

戳我-> 领取自动驾驶近15个方向学习路线

>>自动驾驶前沿信息获取→自动驾驶之心知识星球

本文只做学术分享，如有侵权，联系删文

视觉大语言模型

综述汇总

智能交通和自动驾驶中的 LLM：https://github.com/ge25nab/Awesome-VLM-AD-ITS
AIGC 和 LLM：https://github.com/coderonion/awesome-llm-and-aigc
视觉语言模型综述：https://github.com/jingyi0000/VLM_survey
用于 CLIP 等视觉语言模型的出色提示 / 适配器学习方法：https://github.com/zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs
LLM/VLM 推理论文列表，并附有代码：https://github.com/DefTruth/Awesome-LLM-Inference
大型模型安全、安保和隐私的阅读清单（包括 Awesome LLM security、safety 等）：https://github.com/ThuCCSLab/Awesome-LM-SSP
关于单 / 多智能体、机器人、llm/vlm/mla、科学发现等的知识库：https://github.com/weleen/awesome-agent
关于 Embodied AI 和相关研究 / 行业驱动资源的精选论文列表：https://github.com/haoranD/Awesome-Embodied-AI
一份精心策划的推理策略和算法列表，可提高视觉语言模型（VLM）的性能：https://github.com/Patchwork53/awesome-vlm-inference-strategies
著名的视觉语言模型及其架构：https://github.com/gokayfem/awesome-vlm-architectures

基础理论

预训练

[arxiv 2024] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
[CVPR 2024] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
[CVPR 2024] Do Vision and Language Encoders Represent the World Similarly?
[CVPR 2024] Efficient Vision-Language Pre-training by Cluster Masking
[CVPR 2024] Non-autoregressive Sequence-to-Sequence Vision-Language Models
[CVPR 2024] VTamin: Designing Scalable Vision Models in the Vision-Language Era
[CVPR 2024] Iterated Scoring Improves Compositionality in Large Vision-Language Models
[CVPR 2024] FairCLIP: Harnessing Fairness in Vision-Language Learning
[CVPR 2024] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
[CVPR 2024] CVLA: On Pre-training for Visual Language Models
[CVPR 2024] Generative Region-Language Pretraining for Open-Ended Object Detection
[CVPR 2024] Enhancing Vision-Language Pre-training with Rich Supervisions
[ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
[ICLR 2024] MIMIC: Empowering Vision-language Model with Multi-Modal in-Context Learning
[ICLR 2024] Retrieval-Enhanced Contrastive Vision-Text Models

迁移学习方法

[NeurIPS 2024] Historical Test-time Prompt Tuning for Vision Foundation Models
[NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
[IJCV 2024] Progressive Visual Prompt Learning with Contrastive Feature Re-formation
[ECCV 2024] CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts
[ECCV 2024] FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
[ECCV 2024] GalOP: Learning Global and Local Prompts for Vision-Language Models
[ECCV 2024] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models
[CVPR 2024] Towards Better Vision-Inspired Vision-Language Models
[CVPR 2024] One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
[CVPR 2024] Any-Shot Prompting for Generalization over Distributions
[CVPR 2024] A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
[CVPR 2024] Anchor-based Robust Finetuning of Vision-Language Models
[CVPR 2024] Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
[CVPR 2024] Visual In-Context Prompting
[CVPR 2024] TCP: Textual-based Class-aware Prompt Tuning for Visual-Language Model
[CVPR 2024] Efficient Test-Time Adaptation of Vision-Language Models
[CVPR 2024] Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

知识蒸馏（检测&分割&多任务）

[NeurIPS 2024] Open-Vocabulary Object Detection via Language Hierarchy
[CVPR 2024] RegionGPT: Towards Region Understanding Vision Language Model
[ICLR 2024] LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors
[ICLR 2024] Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction
[ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
[ICLR 2024] FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition
[ICLR 2024] AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
[CVPR 2023] EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata

世界模型

HERMES: A Unified Self - Driving World Model for Simutaneous 3D Scene Understanding and Generation
统一的驾驶世界模型 ——HERMES: 无缝整合了 3D 场景理解和未来场景演化 (生成)
A Survey of World Models for Autonomous Driving
2025 年最新，自动驾驶中的世界模型全面综述
DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT
Diffusion World Model
普林斯顿大学提出扩散世界模型
DrivingGPT: Unifying Driving World Modeling and Planning with Multi - modal Autoregressive Transformers
DrivingGPT: 统一驾驶世界建模和规划
Physical Informed Driving World Model
驾驶视频生成质量最新 SOTA! DrivePhysica: 一个创新符合物理原理的驾驶世界模型
Understanding World or Predicting Future? A Comprehensive Survey of World Models
了解世界或预测未来？世界模型全面综述
Navigation World Models
Meta 最新研究！导航世界模型（Navigation World Model, NWM），一种可控的视频生成模型，能够根据过去的观察和导航动作预测未来的视觉观测
InfinityDrive: Breaking Time Limits in Driving World Models
第一个具有卓越泛化能力的驾驶世界模型：InfinityDrive
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey
简介：探索自动驾驶中视频生成与世界模型之间的相互作用：一项调查
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
首个利用视频生成模型改善驾驶场景 4D 重建的方法！DriveDreamer4D：利用世界模型先验知识增强了 4D 驾驶场景的表示
Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving
驾驶在占用世界：通过自动驾驶的世界模型进行视觉为中心的 4D 占用预测和规划
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
Vista：一个具有高保真度和多功能可控性的可泛化驾驶世界模型！
Probing Multimodal LLMs as World Models for Driving
探索多模态 LLM 作为世界驾驶模型！
DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving
自动驾驶各种任务性能全面提升！DriveWorld：通过自动驾驶世界模型进行 4D 预训练场景理解
Prospective Role of Foundation Models in Advancing Autonomous Vehicles
大规模基础模型在自动驾驶中的应用和趋势
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
DriveDreamer-2：第一个能够生成定制驾驶视频的世界模型
World Models for Autonomous Driving: An Initial Survey
自动驾驶中的世界模型

扩散模型

综述汇总

关于扩散模型的资源和论文集
https://github.com/diff-usion/Awesome-Diffusion-Models
视频生成、编辑、恢复、理解等最新传播模型列表
https://github.com/showlab/Awesome-Video-Diffusion
基于扩散的图像处理综述，包括恢复、增强、编码、质量评估
https://github.com/lixinustc/Awesome-diffusion-model-for-image-processing
图扩散生成工作集合，包括论文、代码和数据集
https://github.com/yuntaoshou/Graph-Diffusion-Models-A-Comprehensive-Survey-of-Methods-and-Applications
Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices [Paper]
Diffusion Models in 3D Vision: A Survey [Paper]
Conditional Image Synthesis with Diffusion Models: A Survey [Paper]
Trustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey [Paper]
A Survey on Diffusion Models for Recommender Systems [Paper]
Diffusion-Based Visual Art Creation: A Survey and New Perspectives [Paper]
Replication in Visual Diffusion Models: A Survey and Outlook [Paper]
Diffusion Model-Based Video Editing: A Survey [Paper]
Diffusion Models and Representation Learning: A Survey [Paper]
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [Paper]
Diffusion Models in Low-Level Vision: A Survey [Paper]
Video Diffusion Models: A Survey [Paper]
A Survey on Diffusion Models for Time Series and Spatio-Temporal Data [Paper]
Controllable Generation with Text-to-Image Diffusion Models: A Survey [Paper]
Diffusion Model-Based Image Editing: A Survey [Paper]
Diffusion Models, Image Super-Resolution And Everything: A Survey [Paper]
A Survey on Video Diffusion Models [Paper]
A Survey of Diffusion Models in Natural Language Processing [Paper]

端到端自动驾驶

主要介绍端到端自动驾驶研究论文集，持续跟踪 E2E 驾驶最新更新

链接1：https://github.com/opendilab/awesome-end-to-end-autonomous-driving#Overview-of-End-to-End-Driving-Method
链接2：https://github.com/Pranav-chib/Recent-Advancements-in-End-to-End-Autonomous-Driving-using-Deep-Learning
[CVPR 2024] Foundation Models for Autonomous Systems
[CVPR 2023] Workshop on End-to-end Autonomous Driving
[CVPR 2023] End-to-End Autonomous Driving: Perception, Prediction, Planning and Simulation
[ICRA 2023] Scalable Autonomous Driving
[NeurIPS 2022] Machine Learning for Autonomous Driving
[IROS 2022] Behavior-driven Autonomous Driving in Unstructured Environments
[ICRA 2022] Fresh Perspectives on the Future of Autonomous Driving Workshop
[NeurIPS 2021] Machine Learning for Autonomous Driving
[NeurIPS 2020] Machine Learning for Autonomous Driving
[CVPR 2020] Workshop on Scalability in Autonomous Driving

自动驾驶之心

论文辅导来啦

知识星球交流社区

近4000人的交流社区，近300+自动驾驶公司与科研结构加入！涉及30+自动驾驶技术栈学习路线，从0到一带你入门自动驾驶感知（大模型、端到端自动驾驶、世界模型、仿真闭环、3D检测、车道线、BEV感知、Occupancy、多传感器融合、多传感器标定、目标跟踪）、自动驾驶定位建图（SLAM、高精地图、局部在线地图）、自动驾驶规划控制/轨迹预测等领域技术方案、大模型，更有行业动态和岗位发布！欢迎加入。

独家专业课程

端到端自动驾驶、大模型、VLA、仿真测试、自动驾驶C++、BEV感知、BEV模型部署、BEV目标跟踪、毫米波雷达视觉融合、多传感器标定、多传感器融合、多模态3D目标检测、车道线检测、轨迹预测、在线高精地图、世界模型、点云3D目标检测、目标跟踪、Occupancy、CUDA与TensorRT模型部署、大模型与自动驾驶、NeRF、语义分割、自动驾驶仿真、传感器部署、决策规划、轨迹预测等多个方向学习视频

学习官网：www.zdjszx.com