资料汇总 | VLM-世界模型-端到端

作者 | qian 编辑 | 自动驾驶之心

 原文链接:https://zhuanlan.zhihu.com/p/1922228114404143784

点击下方卡片,关注“自动驾驶之心”公众号

戳我-> 领取自动驾驶近15个方向学习路线

>>自动驾驶前沿信息获取自动驾驶之心知识星球

本文只做学术分享,如有侵权,联系删文

视觉大语言模型

综述汇总

  • 智能交通和自动驾驶中的 LLM:https://github.com/ge25nab/Awesome-VLM-AD-ITS

  • AIGC 和 LLM:https://github.com/coderonion/awesome-llm-and-aigc

  • 视觉语言模型综述:https://github.com/jingyi0000/VLM_survey

  • 用于 CLIP 等视觉语言模型的出色提示 / 适配器学习方法:https://github.com/zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs

  • LLM/VLM 推理论文列表,并附有代码:https://github.com/DefTruth/Awesome-LLM-Inference

  • 大型模型安全、安保和隐私的阅读清单(包括 Awesome LLM security、safety 等):https://github.com/ThuCCSLab/Awesome-LM-SSP

  • 关于单 / 多智能体、机器人、llm/vlm/mla、科学发现等的知识库:https://github.com/weleen/awesome-agent

  • 关于 Embodied AI 和相关研究 / 行业驱动资源的精选论文列表:https://github.com/haoranD/Awesome-Embodied-AI

  • 一份精心策划的推理策略和算法列表,可提高视觉语言模型(VLM)的性能:https://github.com/Patchwork53/awesome-vlm-inference-strategies

  • 著名的视觉语言模型及其架构:https://github.com/gokayfem/awesome-vlm-architectures

基础理论

预训练
  • [arxiv 2024] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness

  • [CVPR 2024] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

  • [CVPR 2024] Do Vision and Language Encoders Represent the World Similarly?

  • [CVPR 2024] Efficient Vision-Language Pre-training by Cluster Masking

  • [CVPR 2024] Non-autoregressive Sequence-to-Sequence Vision-Language Models

  • [CVPR 2024] VTamin: Designing Scalable Vision Models in the Vision-Language Era

  • [CVPR 2024] Iterated Scoring Improves Compositionality in Large Vision-Language Models

  • [CVPR 2024] FairCLIP: Harnessing Fairness in Vision-Language Learning

  • [CVPR 2024] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

  • [CVPR 2024] CVLA: On Pre-training for Visual Language Models

  • [CVPR 2024] Generative Region-Language Pretraining for Open-Ended Object Detection

  • [CVPR 2024] Enhancing Vision-Language Pre-training with Rich Supervisions

  • [ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

  • [ICLR 2024] MIMIC: Empowering Vision-language Model with Multi-Modal in-Context Learning

  • [ICLR 2024] Retrieval-Enhanced Contrastive Vision-Text Models

迁移学习方法
  • [NeurIPS 2024] Historical Test-time Prompt Tuning for Vision Foundation Models

  • [NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

  • [IJCV 2024] Progressive Visual Prompt Learning with Contrastive Feature Re-formation

  • [ECCV 2024] CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

  • [ECCV 2024] FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance

  • [ECCV 2024] GalOP: Learning Global and Local Prompts for Vision-Language Models

  • [ECCV 2024] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

  • [CVPR 2024] Towards Better Vision-Inspired Vision-Language Models

  • [CVPR 2024] One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

  • [CVPR 2024] Any-Shot Prompting for Generalization over Distributions

  • [CVPR 2024] A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

  • [CVPR 2024] Anchor-based Robust Finetuning of Vision-Language Models

  • [CVPR 2024] Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

  • [CVPR 2024] Visual In-Context Prompting

  • [CVPR 2024] TCP: Textual-based Class-aware Prompt Tuning for Visual-Language Model

  • [CVPR 2024] Efficient Test-Time Adaptation of Vision-Language Models

  • [CVPR 2024] Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

知识蒸馏(检测&分割&多任务)
  • [NeurIPS 2024] Open-Vocabulary Object Detection via Language Hierarchy

  • [CVPR 2024] RegionGPT: Towards Region Understanding Vision Language Model

  • [ICLR 2024] LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

  • [ICLR 2024] Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction

  • [ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

  • [ICLR 2024] FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

  • [ICLR 2024] AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

  • [CVPR 2023] EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata

世界模型

  • HERMES: A Unified Self - Driving World Model for Simutaneous 3D Scene Understanding and Generation
    统一的驾驶世界模型 ——HERMES: 无缝整合了 3D 场景理解和未来场景演化 (生成)

  • A Survey of World Models for Autonomous Driving
    2025 年最新,自动驾驶中的世界模型全面综述

  • DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT

  • Diffusion World Model
    普林斯顿大学提出扩散世界模型

  • DrivingGPT: Unifying Driving World Modeling and Planning with Multi - modal Autoregressive Transformers
    DrivingGPT: 统一驾驶世界建模和规划

  • Physical Informed Driving World Model
    驾驶视频生成质量最新 SOTA! DrivePhysica: 一个创新符合物理原理的驾驶世界模型

  • Understanding World or Predicting Future? A Comprehensive Survey of World Models
    了解世界或预测未来?世界模型全面综述

  • Navigation World Models
    Meta 最新研究!导航世界模型(Navigation World Model, NWM),一种可控的视频生成模型,能够根据过去的观察和导航动作预测未来的视觉观测

  • InfinityDrive: Breaking Time Limits in Driving World Models
    第一个具有卓越泛化能力的驾驶世界模型:InfinityDrive

  • Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey
    简介:探索自动驾驶中视频生成与世界模型之间的相互作用:一项调查

  • DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
    首个利用视频生成模型改善驾驶场景 4D 重建的方法!DriveDreamer4D:利用世界模型先验知识增强了 4D 驾驶场景的表示

  • Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving
    驾驶在占用世界:通过自动驾驶的世界模型进行视觉为中心的 4D 占用预测和规划

  • Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
    Vista:一个具有高保真度和多功能可控性的可泛化驾驶世界模型!

  • Probing Multimodal LLMs as World Models for Driving
    探索多模态 LLM 作为世界驾驶模型!

  • DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving
    自动驾驶各种任务性能全面提升!DriveWorld:通过自动驾驶世界模型进行 4D 预训练场景理解

  • Prospective Role of Foundation Models in Advancing Autonomous Vehicles
    大规模基础模型在自动驾驶中的应用和趋势

  • DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
    DriveDreamer-2:第一个能够生成定制驾驶视频的世界模型

  • World Models for Autonomous Driving: An Initial Survey
    自动驾驶中的世界模型

扩散模型

综述汇总

  • 关于扩散模型的资源和论文集
    https://github.com/diff-usion/Awesome-Diffusion-Models

  • 视频生成、编辑、恢复、理解等最新传播模型列表
    https://github.com/showlab/Awesome-Video-Diffusion

  • 基于扩散的图像处理综述,包括恢复、增强、编码、质量评估
    https://github.com/lixinustc/Awesome-diffusion-model-for-image-processing

  • 图扩散生成工作集合,包括论文、代码和数据集
    https://github.com/yuntaoshou/Graph-Diffusion-Models-A-Comprehensive-Survey-of-Methods-and-Applications

  • Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices [Paper]

  • Diffusion Models in 3D Vision: A Survey [Paper]

  • Conditional Image Synthesis with Diffusion Models: A Survey [Paper]

  • Trustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey [Paper]

  • A Survey on Diffusion Models for Recommender Systems [Paper]

  • Diffusion-Based Visual Art Creation: A Survey and New Perspectives [Paper]

  • Replication in Visual Diffusion Models: A Survey and Outlook [Paper]

  • Diffusion Model-Based Video Editing: A Survey [Paper]

  • Diffusion Models and Representation Learning: A Survey [Paper]

  • A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [Paper]

  • Diffusion Models in Low-Level Vision: A Survey [Paper]

  • Video Diffusion Models: A Survey [Paper]

  • A Survey on Diffusion Models for Time Series and Spatio-Temporal Data [Paper]

  • Controllable Generation with Text-to-Image Diffusion Models: A Survey [Paper]

  • Diffusion Model-Based Image Editing: A Survey [Paper]

  • Diffusion Models, Image Super-Resolution And Everything: A Survey [Paper]

  • A Survey on Video Diffusion Models [Paper]

  • A Survey of Diffusion Models in Natural Language Processing [Paper]

端到端自动驾驶

主要介绍端到端自动驾驶研究论文集,持续跟踪 E2E 驾驶最新更新

  • 链接1:https://github.com/opendilab/awesome-end-to-end-autonomous-driving#Overview-of-End-to-End-Driving-Method

  • 链接2:https://github.com/Pranav-chib/Recent-Advancements-in-End-to-End-Autonomous-Driving-using-Deep-Learning

  • [CVPR 2024] Foundation Models for Autonomous Systems

  • [CVPR 2023] Workshop on End-to-end Autonomous Driving

  • [CVPR 2023] End-to-End Autonomous Driving: Perception, Prediction, Planning and Simulation

  • [ICRA 2023] Scalable Autonomous Driving

  • [NeurIPS 2022] Machine Learning for Autonomous Driving

  • [IROS 2022] Behavior-driven Autonomous Driving in Unstructured Environments

  • [ICRA 2022] Fresh Perspectives on the Future of Autonomous Driving Workshop

  • [NeurIPS 2021] Machine Learning for Autonomous Driving

  • [NeurIPS 2020] Machine Learning for Autonomous Driving

  • [CVPR 2020] Workshop on Scalability in Autonomous Driving

自动驾驶之心

论文辅导来啦

知识星球交流社区

近4000人的交流社区,近300+自动驾驶公司与科研结构加入!涉及30+自动驾驶技术栈学习路线,从0到一带你入门自动驾驶感知(大模型、端到端自动驾驶、世界模型、仿真闭环、3D检测、车道线、BEV感知、Occupancy、多传感器融合、多传感器标定、目标跟踪)、自动驾驶定位建图(SLAM、高精地图、局部在线地图)、自动驾驶规划控制/轨迹预测等领域技术方案、大模型,更有行业动态和岗位发布!欢迎加入。

独家专业课程

端到端自动驾驶大模型、VLA、仿真测试、自动驾驶C++、BEV感知、BEV模型部署、BEV目标跟踪、毫米波雷达视觉融合、多传感器标定、多传感器融合、多模态3D目标检测、车道线检测、轨迹预测、在线高精地图、世界模型、点云3D目标检测、目标跟踪、Occupancy、CUDA与TensorRT模型部署、大模型与自动驾驶、NeRF、语义分割、自动驾驶仿真、传感器部署、决策规划、轨迹预测等多个方向学习视频

学习官网:www.zdjszx.com

MinerU 是一个用于将 PDF 转换为 Markdown 和 JSON 的高质量数据提取工具,其支持多种后端模型进行文档解析,其中 VLM-transformers 和 VLM-sglang 是两种主要的模型架构。它们在技术实现、性能表现和适用场景上存在显著差异。 ### 技术架构差异 VLM-transformers 是基于 HuggingFace Transformers 框架构建的视觉语言模型(Vision-Language Model),其核心是将文档图像与文本信息联合建模。该模型通常使用预训练的视觉编码器(如 DETR 或 Swin Transformer)与文本解码器(如 BART 或 T5)进行端到端训练,以实现对文档结构的理解和内容提取 [^2]。 VLM-sglang 则是基于 SGLang 框架构建的模型,SGLang 是一种高效的推理引擎,专为大语言模型设计,支持动态批处理、并行推理等优化技术。VLM-sglang 在架构上可能结合了轻量级视觉编码器与高效的文本生成模块,以实现快速推理和低延迟响应,适合部署在资源受限的环境中 [^1]。 ### 性能比较 在推理速度方面,VLM-sglang 由于采用了 SGLang 的优化机制,通常比 VLM-transformers 更快。SGLang 支持异步推理和批处理优化,可以显著减少请求延迟并提高吞吐量。相比之下,VLM-transformers 在推理过程中可能需要更长的响应时间,尤其是在处理复杂文档结构时 [^1]。 在资源消耗方面,VLM-sglang 通常对 GPU 内存的需求较低,适合在边缘设备或低配服务器上部署。而 VLM-transformers 由于其模型规模较大,通常需要更高性能的计算资源,例如 A100 或 H100 GPU [^2]。 ### 应用场景 VLM-transformers 更适合需要高精度解析的科研和工程文档,如论文解析、技术手册提取等场景。其强大的建模能力能够处理复杂的表格、公式和多语言内容,适用于对解析质量要求较高的任务 。 VLM-sglang 更适用于大规模部署和实时服务场景,例如在线文档转换平台、企业内部的知识管理系统等。其高效的推理能力使其能够在资源受限环境下保持良好的性能,适合需要快速响应和高并发处理的应用 。 ### 部署与扩展性 VLM-transformers 的部署通常依赖于 HuggingFace Transformers 和 PyTorch 生态,支持多种模型格式和推理框架,但部署流程相对复杂。VLM-sglang 基于 SGLang 构建,支持与多种服务框架集成(如 FastAPI、Ray 等),部署流程更简洁,适合构建微服务架构 。 在模型扩展性方面,VLM-transformers 支持更多种类的预训练模型和微调策略,适合研究人员进行定制化开发。而 VLM-sglang 更注重推理效率,扩展性主要体现在服务端的横向扩展能力 。 ```bash # 使用 VLM-sglang 进行 PDF 解析的示例命令 mineru -p paper.pdf -o output \ -b vlm-sglang-client \ -u http://10.0.0.1:30000 \ --formula on --table on ``` ###
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值