UniAD_面向规划的自动驾驶

最新推荐文章于 2025-03-19 11:16:41 发布

原创

最新推荐文章于 2025-03-19 11:16:41 发布 · 9k 阅读

14 ·

CC 4.0 BY-SA版权

文章标签：

#自动驾驶 #人工智能 #机器学习

Planning-oriented Autonomous Driving

面向规划的自动驾驶

在这里插入图片描述
https://github.com/OpenDriveLab/UniAD

Abstract

Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative errors or deficient task coordination. Instead, we argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query interfaces to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven by substantially outperforming previous state-of-the-arts in all aspects. Code and models are public.
现代自动驾驶系统以顺序执行的模块化任务为特点，即感知、预测和规划。为了执行多样化的任务并达到高级智能水平，当代方法要么为每个单独任务部署独立模型，要么设计具有不同分支的多任务模型。然而，这些方法可能会遇到累积误差或任务协调不足的问题。相反，我们认为应该设计并优化一个有利的框架，以实现最终目标，即自动驾驶汽车的规划。为此，我们重新审视了感知和预测中的关键组件，并优先考虑任务，以确保所有这些任务都能促进规划。我们提出了统一自动驾驶（UniAD），这是一个集成了全栈驾驶任务的全面框架。它精心设计，旨在利用每个模块的优势，并从全局视角为代理交互提供互补的特征抽象。任务通过统一的查询接口进行通信，以相互促进规划。我们在具有挑战性的nuScenes基准测试上实现了UniAD。通过广泛的消融研究，证明了采用这种理念的有效性，它在所有方面都显著超越了以前的最先进水平。代码和模型是公开的。

1. Introduction

With the successful development of deep learning, autonomous driving algorithms are assembled with a series of tasks1, including detection, tracking, mapping in perception; and motion and occupancy forecast in prediction. As depicted in Fig. 1(a), most industry solutions deploy stan-dalone models for each task independently [68, 71], as long as the resource bandwidth of the onboard chip allows. Although such a design simplifies the R&D difficulty across teams, it bares the risk of information loss across modules, error accumulation and feature misalignment due to the isolation of optimization targets [57, 66, 82].
随着深度学习技术的快速发展，自动驾驶算法由多个任务组成，包括感知任务中的物体检测、目标跟踪和地图构建；以及预测任务中的动作预测和占用预测。如图1(a)所示，大多数行业解决方案会为每个任务独立部署单独的模型，只要车载芯片的资源带宽足够。尽管这种设计方法简化了不同团队之间的研发难度，但它也带来了信息在不同模块间丢失、误差累积以及由于优化目标隔离导致的特征不对齐的风险。
在这里插入图片描述
图 1. 展示了自动驾驶框架的多种设计对比。
(a) 大多数行业解决方案为不同任务独立部署了各自的模型。
(b) 多任务学习框架通过共享一个基础网络结构，但为不同任务分配了不同的任务头部。
© 端到端的设计将感知和预测模块整合在一起。之前的尝试要么是在 (c.1) 中直接对规划进行优化，要么是在 (c.2) 中构建包含部分组件的系统。而我们则在 (c.3) 中提出，一个理想的系统不仅应该是面向规划的，还应该适当地组织前期任务，以便于促进规划的进行。
A more elegant design is to incorporate a wide span of tasks into a multi-task learning (MTL) paradigm, by plugging several task-specific heads into a shared feature extractor as shown in Fig. 1(b). This is a popular practice in many domains, including general vision [79,92,108], autonomous driving2 [15, 60, 101, 105], such as Transfuser [20], BEV-erse [105], and industrialized products, e.g., Mobileye [68], Tesla [87], Nvidia [71], etc. In MTL, the co-training strategy across tasks could leverage feature abstraction; it could effortlessly extend to additional tasks, and save computation cost for onboard chips. However, such a scheme may cause undesirable “negative transfer” [23, 64].
一种更优雅的设计方案是将广泛的任务整合到多任务学习（MTL）框架中，通过将多个特定于任务的头部接入一个共享的特征提取器，如图1(b)所示。这种做法在包括通用视觉[79,92,108]、自动驾驶[15,60,101,105]等多个领域中非常流行，例如Transfuser[20]、BEV-erse[105]以及一些产业化产品，如Mobileye[68]、Tesla[87]、Nvidia[71]等。在MTL中，不同任务之间的共同训练策略可以利用特征抽象的优势；它能够轻松扩展到更多的任务，同时为车载芯片节省计算成本。然而，这种方案可能会引发不希望的“负迁移”现象[23,64]。
By contrast, the emergence of end-to-end autonomous driving [11, 15, 19, 38, 97] unites all nodes from perception, prediction and planning as a whole. The choice and priority of preceding tasks should be determined in favor of planning. The system should be planning-oriented, exquisitely designed with certain components involved, such that there are few accumulative error as in the standalone option or negative transfer as in the MTL scheme. Table 1 describes the task taxonomy of different framework designs.
与之相对，端到端自动驾驶[11, 15, 19, 38, 97]将从感知、预测到规划的所有环节整合为一个整体。选择和优先级的任务应当以规划为重。系统应该以规划为导向，精心设计，包含特定组件，以便像独立模型那样减少累积误差，或者像多任务学习（MTL）方案那样减少负迁移现象。表1 描述了不同框架设计中的任务分类。
在这里插入图片描述
表 1. 任务对比和分类。"设计"一栏是按照图 1 进行分类的。"Det."代表三维物体检测，"Map"指的是在线地图制作，而**"Occ."表示占用预测图**。†：这些研究工作虽然不是直接为了规划而提出的，但它们仍然体现了联合感知和预测的共同精神。UniAD执行五个基本的驾驶任务，以帮助实现规划。
Following the end-to-end paradigm, one “tabula-rasa” practice is to directly predict the planned trajectory, without any explicit supervision of perception and prediction as shown in Fig. 1(c.1). Pioneering works [14,16,21,22,78,95, 97, 106] verified this vanilla design in the closed-loop simulation [26]. While such a direction deserves further exploration, it is inadequate in safety guarantee and interpretability, especially for highly dynamic urban scenarios. In this paper, we lean toward another perspective and ask the following question: Toward a reliable and planning-oriented autonomous driving system, how to design the pipeline in favor of planning? which preceding tasks are requisite?
遵循端到端的设计理念，一种“白纸”做法是直接预测计划好的轨迹，而无需对感知和预测进行显式监督，如图1(c.1)所示。一些开创性的研究[14,16,21,22,78,95,97,106]在闭环仿真[26]中验证了这种基本设计的有效性。虽然这种方向值得进一步探索，但它在安全保障和可解释性方面存在不足，特别是在高度动态的城市环境中。在本文中，我们从另一个角度出发，提出了以下问题：为了构建一个可靠且以规划为导向的自动驾驶系统，应该如何设计流程以支持规划？哪些前期任务是必不可少的？
An intuitive resolution would be to perceive surrounding objects, predict future behaviors and plan a safe maneuver explicitly, as illustrated in Fig. 1(c.2). Contemporary approaches [11, 30, 38, 57, 82] provide good insights and achieve impressive performance. However, we argue that the devil lies in the details; previous works more or less fail to consider certain components (see block (c.2) in Table 1), being reminiscent of the planning-oriented spirit. We elaborate on the detailed definition and terminology, the necessity of these modules in the Supplementary.
直观的解决办法是感知周围的物体，预测它们未来的行为，并明确地规划一个安全的行驶路径，如图1(c.2)所展示的那样。现代的方法[11, 30, 38, 57, 82]提供了深刻的见解，并取得了令人瞩目的成果。然而，我们认为问题往往隐藏在细节之中；先前的研究或多或少没有充分考虑到某些组件（参见表1中的(c.2)部分），这与面向规划的精神有所呼应。我们在补充材料中详细解释了这些模块的具体定义和术语，以及它们的重要性。
To this end, we introduce UniAD, a Unified Autonomous Driving algorithm framework to leverage five essential tasks toward a safe and robust system as depicted in Fig. 1(c.3) and Table 1(c.3). UniAD is designed in a planning-oriented spirit. We argue that this is not a simple stack of tasks with mere engineering effort. A key component is the querybased design to connect all nodes. Compared to the classic bounding box representation, queries benefit from a larger receptive field to soften the compounding error from upstream predictions. Moreover, queries are flexible to model and encode a variety of interactions, e.g., relations among multiple agents. To the best of our knowledge, UniAD is the first work to comprehensively investigate the joint cooperation of such a variety of tasks including perception, prediction and planning in the field of autonomous driving.
为了实现这一目标，我们提出了UniAD，一个统一的自动驾驶算法框架，它利用五个基本任务来实现一个安全且稳健的系统，如图1(c.3)和表1(c.3)所展示。UniAD以面向规划的理念设计。我们认为，这不仅仅是简单地将任务堆叠起来，而是需要真正的工程努力。一个关键的组成部分是基于查询的设计，用以连接所有节点。与经典的边界框表示法相比，查询由于具有更大的感知范围，从而能够减轻上游预测带来的累积误差。此外，查询在建模和编码各种交互方面具有灵活性，例如，多个代理之间的关系。据我们所知，UniAD是第一个全面研究在自动驾驶领域内，包括感知、预测和规划在内的多样化任务的联合协作的工作。
The contributions are summarized as follows. (a) we embrace a new outlook of autonomous driving framework following a planning-oriented philosophy, and demonstrate the necessity of effective task coordination, rather than standalone design or simple multi-task learning. (b) we present UniAD, a comprehensive end-to-end system that leverages a wide span of tasks. The key component to hit the ground running is the query design as interfaces connecting all nodes. As such, UniAD enjoys flexible intermediate representations and exchanging multi-task knowledge toward planning. © we instantiate UniAD on the challenging benchmark for realistic scenarios. Through extensive ablations, we verify the superiority of our method over previous state-of-the-arts in all aspects.
贡献可以概括为以下几点：
(a) 我们采纳了一种新的自动驾驶框架视角，遵循以规划为导向的理念，并证明了有效任务协调的必要性，而不是单独设计或简单的多任务学习方法。
(b) 我们介绍了UniAD，这是一个全面的端到端系统，利用了广泛的任务。关键的启动组件是查询设计，它作为连接所有节点的接口。因此，UniAD具有灵活的中间表示形式，并能够为规划目的交流多任务知识。
© 我们在具有挑战性的现实场景基准测试中实现了UniAD。通过广泛的消融研究，我们验证了我们的方法在所有方面都优于以前的最先进水平。
We hope this work could shed some light on the targetdriven design for the autonomous driving system, providing a starting point for coordinating various driving tasks.
我们期望这项研究能够为自动驾驶系统的以目标驱动的设计思路带来一些启发，为整合不同的驾驶任务提供一个初步的出发点。

2. Methodology

Overview. As illustrated in Fig. 2, UniAD comprises four transformer decoder-based perception and prediction modules and one planner in the end. Queries Q play the role of connecting the pipeline to model different interactions of entities in the driving scenario. Specifically, a sequence of multi-camera images is fed into the feature extractor, and the resulting perspective-view features are transformed into a unified bird’s-eye-view (BEV) feature B by an off-theshelf BEV encoder in BEVFormer [55]. Note that UniAD is not confined to a specific BEV encoder, and one can utilize other alternatives to extract richer BEV representations with long-term temporal fusion [31, 74] or multi-modality fusion [58,64]. In TrackFormer, the learnable embeddings that we refer to as track queries inquire about the agents’ information from B to detect and track agents. MapFormer takes map queries as semantic abstractions of road elements (e.g., lanes and dividers) and performs panoptic seg-mentation of the map. With the above queries representing agents and maps, MotionFormer captures interactions among agents and maps and forecasts per-agent future trajectories. Since the action of each agent can significantly impact others in the scene, this module makes joint predictions for all agents considered. Meanwhile, we devise an ego-vehicle query to explicitly model the ego-vehicle and enable it to interact with other agents in such a scenecentric paradigm. OccFormer employs the BEV feature B as queries, equipped with agent-wise knowledge as keys and values, and predicts multi-step future occupancy with agent identity preserved. Finally, Planner utilizes the expressive ego-vehicle query from MotionFormer to predict the planning result, and keep itself away from occupied regions predicted by OccFormer to avoid collisions.
概述。如图2 所示，UniAD由四个基于transformer解码器的感知和预测模块和一个规划器组成。查询Q作为连接管道的角色，用于模拟驾驶场景中实体的不同交互。具体来说，一系列多摄像头图像被输入到特征提取器中，然后由现成的BEV编码器在BEFormer[55]中将生成的透视图视图特征转换为统一的鸟瞰视图（BEV）特征B。请注意，UniAD不局限于特定的BEV编码器，人们可以使用其他替代方案来提取具有长期时间融合[31, 74]或多模态融合[58,64]的更丰富的BEV表示。在TrackFormer中，我们所说的可学习的嵌入，即轨迹查询，从B中查询代理的信息以检测和跟踪代理。MapFormer采用地图查询作为道路元素（例如，车道和分隔线）的语义抽象，并执行地图的全视图分割。有了上述代表代理和地图的查询，MotionFormer捕捉代理和地图之间的交互，并预测每个代理的未来轨迹。由于每个代理的行为可能显著影响场景中的其他代理，这个模块会考虑所有代理的联合预测。同时，我们设计了一个自车查询来显式地建模自车，并使其能够在这个以场景为中心的范例中与其他代理交互。OccFormer使用BEV特征B作为查询，配备有代理知识作为键和值，并预测具有代理身份保留的多步未来占用。最后，规划器使用来自MotionFormer的表达性强车查询来预测规划结果，并避免与OccFormer预测的占用区域发生碰撞。
在这里插入图片描述
图 2. 展示了统一自动驾驶（UniAD）的流程。这个流程是根据面向规划的设计理念精心构建的。UniAD 不仅仅是一系列任务的简单叠加，而是深入研究了感知和预测中每个模块的作用，充分利用了从前面的节点到驾驶场景中的最终规划的联合优化优势。所有的感知和预测模块都采用了基于 transformer 解码器的结构设计，使用任务查询作为连接各个节点的接口。在流程的末端是一个基于注意力机制的简单规划器，它负责预测自我车辆的未来路径点，同时考虑了从前面的节点提取出的知识。图中展示的占用地图仅用于视觉辅助理解。

2.1. Perception: Tracking and Mapping

TrackFormer. It jointly performs detection and multiobject tracking (MOT) without non-differentiable postprocessing. Inspired by [100, 104], we take a similar query design. Besides the conventional detection queries utilized in object detection [8, 109], additional track queries are introduced to track agents across frames. Specifically, at each time step, initialized detection queries are responsible for detecting newborn agents that are perceived for the first time, while track queries keep modeling those agents detected in previous frames. Both detection queries and track queries capture the agent abstractions by attending to BEV feature B. As the scene continuously evolves, track queries at the current frame interact with previously recorded ones in a self-attention module to aggregate temporal information, until the corresponding agents disappear completely (untracked in a certain time period). Similar to [8], TrackFormer contains N layers and the final output state QA provides knowledge of Na valid agents for downstream prediction tasks. Besides queries encoding other agents surrounding the ego-vehicle, we introduce one particular ego-vehicle query in the query set to explicitly model the self-driving vehicle itself, which is further used in planning.
TrackFormer 是 UniAD 框架中的一个核心模块，它同时执行检测和多目标跟踪（MOT），无需进行不可微分的后处理步骤。TrackFormer 的设计受到了 [100, 104] 的启发，采用了类似的查询设计理念。除了在目标检测中常用的检测查询之外，还引入了额外的跟踪查询来实现跨帧的代理跟踪。具体来说，在每个时间点，初始化的检测查询负责识别首次被感知到的新代理，而跟踪查询则继续对之前帧中已检测到的代理进行建模。无论是检测查询还是跟踪查询，都通过关注 BEV 特征 B 来捕获代理的特征表示。随着场景的持续变化，当前帧中的跟踪查询会与之前记录的查询在自注意力模块中相互作用，以聚合时间信息，直到相关代理完全消失（在一定时间段内未被跟踪）。
与 [8] 类似，TrackFormer 包含 N 层，最终输出状态 QA 为下游预测任务提供了 Na 个有效代理的知识。除了编码周围其他代理的查询外，TrackFormer 还在查询集中特别引入了一个自我车辆查询，用于显式地建模自动驾驶车辆本身，这在后续的规划任务中将被进一步使用。
MapFormer. We design it based on a 2D panoptic segmentation method Panoptic SegFormer [56]. We sparsely represent road elements as map queries to help downstream motion forecasting, with location and structure knowledge encoded. For driving scenarios, we set lanes, dividers and crossings as things, and the drivable area as stuff [50]. MapFormer also has N stacked layers whose output results of each layer are all supervised, while only the updated queries QM in the last layer are forwarded to MotionFormer for agent-map interaction.
MapFormer 是基于 2D 全景分割方法 Panoptic SegFormer [56] 设计的组件。它将道路元素以稀疏的地图查询形式表示，以辅助后续的运动预测任务，同时将位置和结构信息进行编码。在驾驶场景中，我们将车道、分隔线和交叉口定义为“事物”（things），而将可行驶区域定义为“物质”（stuff）[50]。
MapFormer 也采用了 N 层堆叠的设计，每一层的输出都会受到监督。但是，只有最后一层中更新后的查询 QM 会被传递到 MotionFormer，用于处理代理与地图之间的交互。

2.2. Prediction: Motion Forecasting

Recent studies have proven the effectiveness of transformer structure on the motion task [43,44,63,69,70,84,99], inspired by which we propose MotionFormer in the end-toend setting. With highly abstract queries for dynamic agents QA and static map QM from TrackFormer and MapFormer respectively, MotionFormer predicts all agents’ multimodal future movements, i.e., top-k possible trajectories, in a scene-centric manner. This paradigm produces multi-agent trajectories in the frame with a single forward pass, which greatly saves the computational cost of aligning the whole scene to each agent’s coordinate [49]. Meanwhile, we pass the ego-vehicle query from TrackFormer through MotionFormer to engage ego-vehicle to interact with other agents, considering the future dynamics. Formally, the output motion is formulated as {xˆi,k ∈ RT×2|i = 1, . . . , Na; k = 1, . . . , K} , where i indexes the agent, k indexes the modality of trajectories and T is the length of prediction horizon.
近期的研究已经证实了 transformer 结构在运动预测任务上的有效性 [43,44,63,69,70,84,99]，正是基于这些研究成果，我们在端到端的框架中引入了 MotionFormer。MotionFormer 使用 TrackFormer 和 MapFormer 提供的动态代理的高度抽象查询 QA 和静态地图的查询 QM，以场景为中心的方式预测所有代理的多模态未来运动，也就是 top-k 个可能的轨迹。这种模式可以在单次前向传播中生成框架内的多代理轨迹，这大大减少了将整个场景与每个代理坐标系对齐所需的计算成本 [49]。同时，我们将 TrackFormer 中的自我车辆查询传递给 MotionFormer，使自我车辆能够与其他代理进行交互，并考虑未来的动态变化。具体来说，输出的运动被表达为 $\{ {x̂_{i,k} ∈ \mathbb{R}^{T×2} | i = 1, ..., N_a; k = 1, ..., K}\}$ ，其中 i 表示代理的索引，k 表示轨迹模态的索引，T 代表预测时间范围的长度。
MotionFormer. It is composed of N layers, and each layer captures three types of interactions: agent-agent,agent-map and agent-goal point. For each motion query Qi,k (defined later, and we omit subscripts i, k in the following context for simplicity), its interactions between other agents QA or map elements QM could be formulated as:
MotionFormer 由 N 层构成，每一层都负责捕捉三种类型的交互：代理之间的交互、代理与地图的交互，以及代理与目标点的交互。对于每一个运动查询 $Q_{i,k}$ （将在之后定义，为了简化表述，在接下来的上下文中我们将省略下标 i, k），它与其他代理 QA 或地图元素 QM 之间的交互。这种设计使得 MotionFormer 能够在每一层中细致地处理并整合来自不同代理和环境元素的复杂交互，从而生成准确的运动预测。代理间的交互帮助模型理解不同交通参与者的相互影响，代理与地图的交互使模型能够考虑道路结构和环境特征，而代理与目标点的交互则使模型能够预测代理朝目标方向的运动趋势。
在这里插入图片描述
where MHCA, MHSA denote multi-head cross-attention and multi-head self-attention [91] respectively. As it is also important to focus on the intended position, i.e., goal point, to refine the predicted trajectory, we devise an agent-goal point attention via deformable attention [109] as follows:
这里，MHCA 和 MHSA 分别代表多头交叉注意力和多头自注意力[91]。由于关注预期位置，也就是目标点，对于细化预测的轨迹同样至关重要，我们通过变形注意力机制[109]设计了一种代理到目标点的注意力机制, 这种机制使得模型能够不仅考虑当前的交互，还考虑代理预期达到的目标点，从而提升轨迹预测的准确性和目标导向性。变形注意力机制可以动态调整其感知范围，以便更好地适应目标点周围环境的变化。
在这里插入图片描述
where xˆl−1 T is the endpoint of the predicted trajectory of previous layer. DeformAttn(q,r,x), a deformable attention module, takes in the query q, reference point r and spatial feature x. It performs sparse attention on the spatial feature around the reference point. Through this, the predicted trajectory is further refined as aware of the endpoint surroundings. All three interactions are modeled in parallel, where the generated Qa, Qm and Qg are concatenated and passed to a multi-layer perceptron (MLP), resulting query context Qctx. Then, Qctx is sent to the successive layer for refinement or decoded as prediction results at the last layer.
在这里， $\hat{x}^{l-1}_T$ 表示前一层预测出的轨迹终点。 $De f or m A tt n (q, r, x)$ 是一个变形注意力模块，它接收查询 ( q )、参考点 ( r ) 和空间特征 ( x )。该模块在参考点周围的空间特征上执行稀疏注意力操作。这样，预测的轨迹就能够根据终点周围的环境进一步进行精细化调整。代理-代理、代理-地图和代理-目标点这三种交互是并行建模的，生成的 ( Q_a )、( Q_m ) 和 ( Q_g ) 被串联起来，并通过一个多层感知器（MLP），生成查询上下文 $Q_{ctx}$ 。随后， $Q_{ctx}$ 被传递到下一层以进行进一步的细化，或者在最后一层被解码为预测结果。
Motion queries. The input queries for each layer of MotionFormer, termed motion queries, comprise two components: the query context Qctx produced by the preceding layer as described before, and the query position Qpos. Specifically, Qpos integrates the positional knowledge in four-folds as in Eq. (3): (1) the position of scene-level anchors Is; (2) the position of agent-level anchors Ia; (3) current location of the agent i and (4) the predicted goal point.
运动查询。在 MotionFormer 中，每一层的输入查询，称为运动查询，由两部分组成：一部分是由前一层生成的查询上下文 $Q_{ctx}$ ，另一部分是查询位置 $Q_{pos}$ 。具体来说， $Q_{pos}$ 融合了四方面的位置信息，如公式（3）所示：(1) 场景级锚点的位置 $I^s$ ；(2) 代理级锚点的位置 $I^a$ ；(3) 代理 i 的当前位置；(4) 预测的目标点。
这种设计使得模型能够在每一层都考虑到代理的当前状态和目标点，从而更精确地模拟和预测代理的运动。通过结合场景级和代理级锚点的位置信息，模型能够更深入地理解交通环境和代理之间的相互关系。
在这里插入图片描述
Here the sinusoidal position encoding PE(·) followed by an MLP is utilized to encode the positional points and xˆ0 T is set as Is at the first layer (subscripts i, k are also omitted). The scene-level anchor represents prior movement statistics in a global view, while the agent-level anchor captures the possible intention in the local coordinate. They are both clustered by k-means algorithm on the endpoints of groundtruth trajectories, to narrow down the uncertainty of prediction. Contrary to the prior knowledge, the start point provides customized positional embedding for each agent, and the predicted endpoint serves as a dynamic anchor optimized layer-by-layer in a coarse-to-fine fashion.
这里采用了正弦位置编码 $PE(\cdot)$ ，随后通过一个多层感知器（MLP）对位置点进行编码，第一层中将 $\hat{x}_0^T$ 设定为场景级锚点 $I^s$ （同样，下标 i, k 在此被省略）。场景级锚点反映了全局视角下的先验运动统计信息，而代理级锚点则捕捉了局部坐标系中可能的意图。这两个锚点都是基于真实轨迹端点通过 k-means 聚类算法来确定的，目的是减少预测的不确定性。与这些先验知识相对，起始点为每个代理提供了定制化的位置嵌入，而预测的终点则充当了一个动态锚点，它在每一层中以从粗糙到精细的方式进行优化。
Non-linear Optimization. Different from conventional motion forecasting works which have direct access to ground truth perceptual results, i.e., agents’ location and corresponding tracks, we consider the prediction uncertainty from the prior module in our end-to-end paradigm. Brutally regressing the ground-truth waypoints from an imperfect detection position or heading angle may lead to unrealistic trajectory predictions with large curvature and acceleration. To tackle this, we adopt a non-linear smoother [7] to adjust the target trajectories and make them physically feasible given an imprecise starting point predicted by the upstream module. The process is:
非线性优化。与那些能够直接获取真实感知结果——即代理的位置和相应轨迹——的传统运动预测方法不同，在端到端的框架中，我们考虑了来自前一个模块的预测不确定性。如果从不准确的检测位置或航向角直接回归真实航点，可能会导致预测出具有大曲率和加速度的不切实际的轨迹。为了解决这个问题，我们采用了非线性平滑器[7]来调整目标轨迹，使其在给定上游模块预测的不精确起点的情况下物理上可行。这个过程包括：

初始化：使用前一个模块预测的起点和目标点来初始化轨迹。
迭代调整：通过非线性平滑器对轨迹进行迭代调整，以最小化预测轨迹与物理可行性之间的差异。
平滑约束：应用平滑约束以避免轨迹出现不切实际的曲率变化。
终止条件：当轨迹满足一定的物理合理性标准或达到预定的迭代次数时，优化过程结束。

在这里插入图片描述
where x˜ and x˜∗ denote the ground-truth and smoothed trajectory, x is generated by multiple-shooting [3], and the cost function is as follows:
在这里， $\tilde{x}$ 表示真实轨迹（ground-truth trajectory），而 $\tilde{x}^*$ 表示经过平滑处理后的轨迹（smoothed trajectory）。 $x$ 是通过多重射击法（multiple-shooting）[3]生成的，成本函数（cost function）如下所示：
在这里插入图片描述

多重射击法是一种用于解决边界值问题的数值优化技术，它将问题分解为多个子问题，每个子问题都有自己的初始条件和目标条件，然后通过迭代求解这些子问题来逼近整个问题的解。
where λ xy and λgoal are hyperparameters, the kinematic function set Φ has five terms including jerk, curvature, curvature rate, acceleration and lateral acceleration. The cost function regularizes the target trajectory to obey kinematic constraints. This target trajectory optimization is only conducted in training and does not affect inference.
这里的 $\lambda_{xy}$ 和 $\lambda_{\text{goal}}$ 是超参数，而动力学函数集合 $\Phi$ 包括五个术语：急动度、曲率、曲率变化率、加速度和侧向加速度。成本函数用于使目标轨迹符合动力学约束。这种目标轨迹的优化仅在训练阶段进行，并不会影响模型的推理过程。

2.3. Prediction: Occupancy Prediction

Occupancy grid map is a discretized BEV representation where each cell holds a belief indicating whether it is occupied, and the occupancy prediction task is to discover how the grid map changes in the future. Previous approaches utilize RNN structure for temporally expanding future predictions from observed BEV features [35,38,105]. However, they rely on highly hand-crafted clustering postprocessing to generate per-agent occupancy maps, as they are mostly agent-agnostic by compressing BEV features as a whole into RNN hidden states. Due to the deficient usage of agent-wise knowledge, it is challenging for them to predict the behaviors of all agents globally, which is essential to understand how the scene evolves. To address this, we present OccFormer to incorporate both scene-level and agent-level semantics in two aspects: (1) a dense scene feature acquires agent-level features via an exquisitely designed attention module when unrolling to future horizons; (2) we produce instance-wise occupancy easily by a matrix multiplication between agent-level features and dense scene features without heavy post-processing.
占用网格地图是一种将鸟瞰视图（BEV）离散化的表现形式，地图上的每个单元格都有一个表示其被占用与否的信念值。占用预测任务旨在探索这个网格地图在未来的变化情况。之前的方法利用递归神经网络（RNN）结构，基于观察到的 BEV 特征来扩展对未来的预测[35,38,105]。但这些方法大多需要依赖手工定制的聚类后处理步骤来生成每个代理的占用地图，因为它们通常会将 BEV 特征作为一个整体压缩进 RNN 的隐状态，而不考虑个别代理的特性。由于缺乏对个别代理知识的利用，这些方法在预测所有代理的全局行为时面临挑战，而这对于理解场景的演变至关重要。为了解决这个问题，我们提出了 OccFormer，它在两个层面上整合了场景级和代理级语义信息：(1) 在展开到未来时间视野的过程中，密集的场景特征通过一个精心设计的关注模块来获取代理级特征；(2) 我们通过代理级特征与密集场景特征之间的矩阵乘法，无需复杂的后处理步骤，就能轻松生成针对每个实例的占用地图。
OccFormer is composed of To sequential blocks where To indicates the prediction horizon. Note that To is typically smaller than T in the motion task, due to the high computation cost of densely represented occupancy. Each block takes as input the rich agent features Gt and the state (dense feature) Ft−1 from the previous layer, and generates Ft for timestep t considering both instance- and scene-level information. To get agent feature Gt with dynamics and spatial priors, we max-pool motion queries from MotionFormer in the modality dimension denoted as QX ∈ RNa×D, with D as the feature dimension. Then we fuse it with the upstream track query QA and current position embedding PA via a temporal-specific MLP:
OccFormer由预测时间范围 $T_o$ 个序列块组成，其中 $T_o$ 代表预测的时间范围。请注意，由于密集表示的占用预测的高计算成本， $T_o$ 通常小于运动预测任务中的 $T$ 。每个块接收丰富的代理特征 $G_t$ 和前一层的状态（密集特征） $F_{t-1}$ 作为输入，并生成 考虑实例级和场景级信息的时间步 $t$ 的特征 $F_t$ 。为了 获得具有动态和空间先验的代理特征 $G_t$