PlanT：通过对象级表示的可解释规划变换器

最新推荐文章于 2025-12-02 15:02:30 发布

原创

最新推荐文章于 2025-12-02 15:02:30 发布 · 7.5k 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#自动驾驶 #机器学习 #人工智能

PlanT: Explainable Planning Transformers via Object-Level Representations

PlanT：通过对象级表示的可解释规划变换器

paper
code

在这里插入图片描述

Abstract

Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicle and road context information. In this paper, we propose PlanT, a novel approach for planning in the context of self-driving that uses a standard transformer architecture. PlanT is based on imitation learning with a compact object-level input representation. On the Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the driving score of the expert) while being 5.3× faster than equivalent pixel-based planning baselines during inference. Combining PlanT with an off-the-shelf perception module provides a sensorbased driving system that is more than 10 points better in terms of driving score than the existing state of the art. Furthermore, we propose an evaluation protocol to quantify the ability of planners to identify relevant objects, providing insights regarding their decision-making. Our results indicate that PlanT can focus on the most relevant object in the scene, even when this object is geometrically distant.
在复杂环境中规划最优路线需要对周围场景进行高效的推理。虽然人类驾驶员会优先考虑重要的物体并忽略与决策无关的细节，但基于学习的规划器通常从包含所有车辆和道路环境信息的密集、高维网格表示中提取特征。在本文中，我们提出了PlanT，这是一种新颖的自动驾驶规划方法，它使用标准的Transformer架构。PlanT基于模仿学习，使用紧凑的对象级输入表示。在CARLA的Longest6基准测试中，PlanT的表现超越了所有先前的方法（与专家的驾驶得分相匹配），同时在推理过程中比等效的基于像素的规划基线快5.3倍。将PlanT与现成的感知模块结合起来，提供了一个基于传感器的驾驶系统，其驾驶得分比现有技术高出10多分。此外，我们提出了一个评估协议，以量化规划器识别相关对象的能力，为它们的决策过程提供洞察。我们的结果表明，即使相关对象在几何上较远，PlanT也能专注于场景中最相关的对象。
Keywords: Autonomous Driving, Transformers, Explainability

1 Introduction

The ability to plan is an important aspect of human intelligence, allowing us to solve complex navigation tasks. For example, to change lanes on a busy highway, a driver must wait for sufficient space in the new lane and adjust the speed based on the expected behavior of the other vehicles. Humans quickly learn this and can generalize to new scenarios, a trait we would also like autonomous agents to have. Due to the difficulty of the planning task, the field of autonomous driving is shifting away from traditional rule-based algorithms [1, 2, 3, 4, 5, 6, 7, 8] towards learning-based solutions [9, 10, 11, 12, 13, 14]. Learning-based planners directly map the environmental state representation (e.g., HD maps and object bounding boxes) to waypoints or vehicle controls. They emerged as a scalable alternative to rule-based planners which require significant manual effort to design.
规划能力是人类智能的一个重要方面，它使我们能够解决复杂的导航任务。例如，在繁忙的高速公路上换道时，驾驶员必须等待新车道有足够的空间，并根据其他车辆预期的行为调整速度。人类可以迅速学会这一点，并且能够将其推广到新的场景中，这是我们也希望自动驾驶代理拥有的特性。由于规划任务的难度，自动驾驶领域正从传统的基于规则的算法[1, 2, 3, 4, 5, 6, 7, 8]转向基于学习的解决方案[9, 10, 11, 12, 13, 14]。基于学习的规划器直接将环境状态表示（例如，高清地图和对象边界框）映射到航点或车辆控制。它们作为可扩展的替代方案出现了，因为基于规则的规划器需要大量的手动工作来设计。
基于学习的规划器通常使用深度学习或强化学习技术，它们可以从数据中学习如何做出决策。这些系统通过分析大量的驾驶场景和相应的成功或失败的决策来学习，从而能够预测在特定情况下的最佳行动方案。以下是一些关键点，概述了基于学习的方法如何应用于自动驾驶规划：
Interestingly, while humans reason about the world in terms of objects [15, 16, 17], most existing learned planners [9, 12, 18] choose a high-dimensional pixel-level input representation by rendering bird’s eye view (BEV) images of detailed HD maps (Fig. 1 left). It is widely believed that this kind of accurate scene understanding is key for robust self-driving vehicles, leading to significant interest in recovering pixel-level BEV information from sensor inputs [19, 20, 21, 22, 23, 24]. In this paper, we investigate whether such detailed representations are actually necessary to achieve convincing planning performance. We propose PlanT, a learning-based planner that leverages an object-level representation (Fig. 1 right) as an input to a transformer encoder [25]. We represent a scene as a set of features corresponding to (1) nearby vehicles and (2) the route the planner must follow. We show that despite the low feature dimensionality, our model achieves state-of-the-art results. We then propose a novel evaluation scheme and metric to analyze explainability which is generally applicable to any learning-based planner. Specifically, we test the ability of a planner to identify the objects that are the most relevant to account for to plan a collision-free route.
有趣的是，尽管人类习惯于用对象来推理世界[15, 16, 17]，但大多数现有的学习型规划器[9, 12, 18]选择使用高维像素级输入表示，通过渲染详细的高清地图（HD maps）的鸟瞰图（BEV）图像（图1左）。人们普遍认为，这种精确的场景理解对于鲁棒的自动驾驶车辆至关重要，从而激发了从传感器输入中恢复像素级BEV信息的显著兴趣[19, 20, 21, 22, 23, 24]。在本文中，我们探讨了是否确实需要这种详细的表示才能实现令人信服的规划性能。我们提出了PlanT，一个基于学习的规划器，它利用对象级表示（图1右）作为变换器编码器[25]的输入。我们将场景表示为一组特征，对应于（1）附近的车辆和（2）规划器必须遵循的路线。我们展示了尽管特征维度低，但我们的模型仍然实现了最先进的结果。然后，我们提出了一种新颖的评估方案和度量标准来分析可解释性，这通常适用于任何基于学习的规划器。具体来说，我们测试了规划器识别对规划无碰撞路线最相关对象的能力。
在这里插入图片描述
图1：规划的场景表示。作为像素级规划器（左侧）主流范式的替代，我们展示了紧凑的对象级表示（右侧）的有效性
We perform a detailed empirical analysis of learning-based planning on the Longest6 benchmark [26] of the CARLA simulator [27]. We first identify the key missing elements in the design of existing learned planners such as their incomplete field of view and sub-optimal dataset and model sizes. We then show the advantages of our proposed transformer architecture, including improvements in performance and significantly faster inference times. Finally, we show that the attention weights of the transformer, which are readily accessible, can be used to represent object relevance. Our qualitative and quantitative results on explainability confirm that PlanT attends to the objects that match our intuition for the relevance of objects for safe driving.
在CARLA模拟器[27]的Longest6基准测试[26]上，我们对基于学习的规划进行了详细的实证分析。首先，我们确定了现有学习型规划器设计中缺失的关键元素，例如它们不完整的视野范围和次优的数据集及模型大小。然后，我们展示了我们提出的变换器架构的优势，包括性能的提升和显著更快的推理时间。最后，我们展示了transformer的注意力权重，这些权重容易获取，可以用来表示对象的相关性。我们在可解释性方面的定性和定量结果证实，PlanT关注的对象符合我们对安全驾驶中对象相关性的直觉。
Contributions. (1) Using a simple object-level representation, we significantly improve upon the previous state of the art for planning on CARLA via PlanT, our novel transformer-based approach. (2) Through a comprehensive experimental study, we identify that the ego vehicle’s route, a full 360° field of view, and information about vehicle speeds are critical elements of a planner’s input representation. (3) We propose a protocol and metric for evaluating a planner’s prioritization of obstacles in a scene and show that PlanT is more explainable than CNN-based methods, i.e., the attention weights of the transformer identify the most relevant objects more reliably.
贡献概述：

创新的transformer基础方法：通过使用简单的对象级表示，我们在CARLA上的规划性能上显著超越了之前的最佳水平，这得益于我们的新型transformer基础方法PlanT。
关键输入元素的识别：通过全面的实验研究，我们确定了自动驾驶车辆的路线、完整的360°视野范围以及车辆速度信息是规划器输入表示中的关键元素。
评估协议和度量的提出：我们提出了一种评估规划器在场景中优先考虑障碍物的协议和度量标准，并展示了PlanT比基于CNN的方法具有更高的可解释性，即变换器的注意力权重更可靠地识别了最相关对象。

2 Related Work

Intermediate Representations for Driving. Early work on decoupling end-to-end driving into two stages predicts a set of low-dimensional affordances from sensor inputs with CNNs which are then input to a rule-based planner [28]. These affordances are scene-descriptive attributes (e.g. emergency brake, red light, center-line distance, angle) that are compact, yet comprehensive enough to enable simple driving tasks, such as urban driving on the initial version of CARLA [27]. Unfortunately, methods based on affordances perform poorly on subsequent benchmarks in CARLA which involve higher task complexity [29]. Most state-of-the-art driving models instead rely heavily on annotated 2D data either as intermediate representations or auxiliary training objectives [26, 30]. Several subsequent studies show that using semantic segmentation as an intermediate representation helps for navigational tasks [31, 32, 33, 34]. More recently, there has been a rapid growth in interest on using BEV semantic segmentation maps as the input representation to planners [9, 12, 30, 18]. To reduce the immense labeling cost of such segmentation methods, Behl et al. [35] propose visual abstractions, which are label-efficient alternatives to dense 2D semantic segmentation maps. They show that reduced class counts and the use of bounding boxes instead of pixel-accurate masks for certain classes is sufficient. Wang et al. [36] explore the use of object-centric representations for planning by explicitly extracting objects and rendering them into a BEV input for a planner. However, so far, the literature lacks a systematic analysis of whether object-centric representations are better or worse than BEV context techniques for planning in dense traffic, which we address in this work. We keep our representation simple and compact by directly considering the set of objects as inputs to our models. In addition to baselines using CNNs to process the object-centric representation, we show that using a transformer leads to improved performance, efficiency, and explainability.
驾驶的中间表示。在将端到端驾驶解耦为两个阶段的早期工作中，使用CNN从传感器输入预测一组低维的可驾驶性特征，然后将这些特征输入到基于规则的规划器[28]。这些可驾驶性特征是描述场景的属性（例如紧急刹车、红灯、中线距离、角度），它们紧凑但足够全面，能够支持简单的驾驶任务，如在CARLA[27]的初始版本上的城市驾驶。不幸的是，基于可驾驶性特征的方法在涉及更高任务复杂性的后续CARLA基准测试中表现不佳[29]。大多数最先进的驾驶模型严重依赖于标注的2D数据，无论是作为中间表示还是辅助训练目标[26, 30]。几项随后的研究表明，使用语义分割作为中间表示有助于导航任务[31, 32, 33, 34]。最近，使用BEV语义分割图作为规划器输入表示的兴趣迅速增长[9, 12, 30, 18]。为了减少这种分割方法的巨大标注成本，Behl等人[35]提出了视觉抽象，这是密集2D语义分割图的标签高效替代品。他们表明，减少类别数量和对某些类别使用边界框而不是像素精确的掩码就足够了。Wang等人[36]通过显式提取对象并将它们渲染成规划器的BEV输入，探索了使用以对象为中心的表示进行规划。然而，到目前为止，文献中缺乏系统分析以对象为中心的表示是否比BEV上下文技术更适合密集交通中的规划，这是我们在这项工作中解决的。我们通过直接将对象集合作为模型的输入，保持我们的表示简单和紧凑。除了使用CNN处理以对象为中心的表示的基线外，我们还展示了使用transformer可以提高性能、效率和可解释性。
Transformers for Forecasting. Transformers obtain impressive results in several research areas [25, 37, 38, 39], including simple interactive environments such as Atari games [40, 41, 42, 43, 44]. While the end objective differs, one application domain that involves similar challenges to planning is motion forecasting. Most existing motion forecasting methods use a rasterized input in combination with a CNN-based network architecture [45, 46, 47, 48, 49, 50]. Gao et al. [51] show the advantages of object-level representations for motion forecasting via Graph Neural Networks (GNN). Several follow-ups to this work use object-level representations in combination with Transformer-based architectures [52, 53, 54]. Our key distinctions when compared to these methods are the architectural simplicity of PlanT (our use of simple self-attention transformer blocks and the proposed route representation) as well as our closed-loop evaluation protocol (we evaluate the driving performance in simulation and report online driving metrics).
transformer在预测方面的应用。变换器在多个研究领域[25, 37, 38, 39]取得了令人印象深刻的成果，包括像Atari游戏[40, 41, 42, 43, 44]这样的简单交互环境。虽然最终目标不同，但运动预测是一个涉及类似挑战的应用领域。大多数现有的运动预测方法使用光栅化输入与基于CNN的网络架构[45, 46, 47, 48, 49, 50]。Gao等人[51]通过图神经网络(GNN)展示了对象级表示在运动预测方面的优势。这项工作的几项后续研究使用对象级表示与基于变换器的架构相结合[52, 53, 54]。与这些方法相比，我们的关键区别在于PlanT的架构简单性（我们使用简单的自注意力变换器块和提出的道路表示）以及我们的闭环评估协议（我们在模拟中评估驾驶性能并报告在线驾驶指标）。
Explainability. Explaining the decisions of neural networks is a rapidly evolving research field [55, 56, 57, 58, 59, 60, 61]. In the context of self-driving cars, existing work uses text [62] or heatmaps [63] to explain decisions. In our work, we can directly obtain post hoc explanations for decisions of our learning-based PlanT architecture by considering its learned attention. While the concurrent work CAPO [64] uses a similar strategy, it only considers pedestrian-ego interactions on an empty route, while we consider the full planning task in an urban environment with dense traffic. Furthermore, we introduce a simple metric to measure the quality of explanations for a planner.
可解释性。解释神经网络的决策是一个快速发展的研究领域[55, 56, 57, 58, 59, 60, 61]。在自动驾驶汽车的背景下，现有的工作使用文本[62]或热图[63]来解释决策。在我们的工作中，我们可以通过考虑PlanT架构学习到的注意力来直接获得其决策的事后解释。虽然并行工作CAPO[64]采用了类似的策略，但它只考虑了空旷路线上的行人-自我交互，而我们考虑了在密集交通的城市环境中的完整规划任务。此外，我们引入了一个简单的度量标准来衡量规划器解释的质量。

3 Planning Transformers

In this section, we provide details about our task setup, novel scene representation, simple but effective architecture, and training strategy resulting in state-of-the-art performance. A PyTorch-style pseudo-code snippet outlining PlanT and its training is provided in the supplementary material.
在这一部分，我们提供了有关我们任务设置的详细信息，新颖的场景表示，简单但有效的架构，以及导致最先进性能的训练策略。在补充材料中提供了一个PyTorch风格的伪代码片段，概述了PlanT及其训练。
Task. We consider the task of point-to-point navigation in an urban setting where the goal is to drive from a start to a goal location while reacting to other dynamic agents and following traffic rules. We use Imitation Learning (IL) to train the driving agent. The goal of IL is to learn a policy π that imitates the behavior of an expert π∗ (the expert implementation is described in Section 4). In our setup, the policy is a mapping π : X −! W from our novel object-level input representation X to the future trajectory W of an expert driver. For following traffic rules, we assume access to the state of the next traffic light relevant to the ego vehicle l 2 fgreen; redg.
任务。我们考虑的是在城市环境中的点对点导航任务，目标是在对其他动态代理做出反应并遵守交通规则的同时，从起点驾驶到目标位置。我们使用模仿学习（Imitation Learning, IL）来训练驾驶代理。IL的目标是学习一个策略π，该策略模仿专家π*的行为（专家实现在第4节中有描述）。在我们的设置中，策略是从我们新颖的对象级输入表示 X 到专家驾驶员的未来轨迹W的映射π : X → W。为了遵守交通规则，我们假设可以访问与自我车辆相关的下一个交通灯的状态，即交通灯的颜色状态（绿色或红色）。
Tokenization. To encode the task-specific information required from the scene, we represent it using a set of objects, with vehicles and segments of the route each being assigned an oriented bounding box in BEV space (Fig. 1 right). Let Xt = Vt[St, where Vt 2 RVt×A and St 2 RSt×A represent the set of vehicles and the set of route segments at time-step t with A = 6 attributes each. Specifically, if oi;t 2 Xt represents a particular object, the attributes of oi;t include an object type-specific attribute zi;t (described below), the position of the bounding box (xi;t; yi;t) relative to the ego vehicle, the orientation ’i;t 2 [0; 2π], and the extent (wi;t; hi;t). Thus, each object oi;t can be described as a vector oi;t = fzi;t; xi;t; yi;t; ’i;t; wi;t; hi;tg, or concisely as foi;t;ag6 a=1.
标记化，为了编码场景中所需的特定任务信息，我们使用一组对象来表示它，其中车辆和路线段在BEV（鸟瞰图）空间中各自被分配了一个有向边界框（图1右侧）。设 $X_t = V_t \cup S_t$ ，其中 $V_t \in \mathbb{R}^{V_t \times A}$ 和 $S_t \in \mathbb{R}^{S_t \times A}$ 分别表示时间步 t 时的车辆集合和路线段集合，每个集合中的每个元素都有 A 个属性。具体来说，如果 $o_{i,t} \in X_t$ 表示一个特定的对象， $o_{i,t}$ 的属性包括一个对象类型特定的属性 $z_{i,t}$ （下面描述），相对于自我车辆的边界框位置 $x_{i,t}, y_{i,t})$ ，方向 $\varphi_{i,t}$

最低0.47元/天解锁文章