3 Problem Formulation
Given a language instruction I and the agent’s ego-
centric observation O, the aerial VLN agent has
to determine a sequence of action to reach the tar-
get location pd in a continuous 3D space. At each
action step t, the agent follows a policy π taking
current observation ot and instruction I as input to
predict the next action at and move to location pt
by its kinematic model F, which is given by:
pt = F(pt−1, π(ot, I)), (1)
Given a sequence of action, the agent reaches a
final position, and the success probability Ps of
reaching the target pd is
Ps = P (||F(π(p0, O, I)) − pd|| < ϵ), (2)
where || · || is the Euclidean distance and ϵ is the
threshold that indicates if the target is reached.
Thus, the goal of VLN is to find a policy π∗ that
maximizes the success rate, given by:
π∗ = argmaxπPs. (3)
4 CityNavAgent
In this section, we present the workflow of the pro-
posed CityNavAgent for zero-shot aerial VLN in
urban environments. As shown in Figure 1, City-
NavAgent framework comprises three key mod-
ules. 1) The open-vocabulary perception module
extracts structured semantic and spatial informa-
tion from its panoramic observation via a founda-
tion model. 2) The hierarchical semantic planning
module (HSPM) leverages LLM to decompose the
navigation task into hierarchical planning sub-tasks
to reduce the planning complexity and predict the
intermediate waypoint for the execution module.
3) The global memory module, represented as a
topological graph, stores valid waypoints extracted
from historical trajectories to further reduce the
action space of motion planning and enhance long-
range navigation stability.
4.1 Open-vocabulary Perception Module
To accurately understand complex semantics in the
urban environment, we leverage the powerful open-
vocabulary captioning and grounding model to ex-
tract scene semantic features. Besides, integrating
the scene semantics and depth information, we con-
struct a 3D semantic map for further planning.
Scene Semantic Perception Extracting urban
scene semantics requires robust open-vocabulary
object recognition. As shown in Figure 2, given a
set of panoramic images {It
Ri } at step t, we first
Memory Graph Update
History TrajectoryHistory Trajectory
Hierarchical Semantic Planning Module
Global Memory Module
Memory Graph Search
Semantic Graph
Search
Waypoint
Sequence
Object-level Planner
Start Landmark
Objects on
the way
Object-level Planner
Start Landmark
Objects on
the wayStart End
Landmarks on
the path
Landmark-level Planner
Start End
Landmarks on
the path
Landmark-level Planner Motion-level
Planner
Motion-level
Planner
Visual
Detector (VD)
Visual
Detector (VD)
Image
Captioner
(CAP)
Image
Captioner
(CAP)
Image
Captioner
(CAP)
Segmentation
(SEG)
Visual
Tokenizer (VT)
Visual
Tokenizer (VT)
Semantic Point CloudSemantic Point Cloud
Object Phrases
Open-vocabulary
Object phrases:
road, white
buildings
Object Phrases
Open-vocabulary
Object phrases:
road, white
buildings
Geometric
Projector (GP)
Geometric
Projector (GP)
Semantic masksSemantic masksRGB ImagesRGB Images Depth ImagesDepth Images
Open-vocabulary Perception Module
Go straight to the
telephone booth at
the end of the road.
Then find the white
statue on your right.
Instruction
Go straight to the
telephone booth at
the end of the road.
Then find the white
statue on your right.
Instruction
Next WaypointFigure 2: CityNavAgent consists of three key modules. The open-vocabulary module extracts open-vocabulary
objects in the scenes and builds a semantic point cloud of the surroundings. The hierarchical semantic planning
module decomposes the original instruction into sub-goals with different semantic levels and predicts the agent’s
next action to achieve the high-level sub-goals. The global memory module stores historical trajectories to assist
motion planning toward visited targets.
use an open-vocabulary image captioner CAP(·)
empowered by GPT-4V (Achiam et al., 2023) to
generate object captions ct
i for the image It
Ri . Then,
we leverage a visual detector VD(·) named Ground-
ingDINO (Liu et al., 2023a), to generate the bound-
ing boxes obbt
i for captioned objects by
obbi = VD(ct
i, It
Ri ), ct
i = CAP(It
Ri ). (4)
Finally, the bounding boxes are tokenized by a
visual tokenizer VT(·), which is then fed into a
segmentation model SEG(·) (Kirillov et al., 2023)
to generate fine-grained semantic masks It
Si for
objects as follows:
It
Si = SEG(VT(obbi), It
Ri ). (5)
Scene Spatial Perception Considering that ego-
centric views suffer from perspective overlap (An
et al., 2023) and fail to capture 3D spatial rela-
tionships (Gao et al., 2024b), we construct a 3D
point map by projecting segmentation masks of
observations into metric 3D space. Leveraging the
RGB-D sensor’s depth map It
Di and agent’s pose
(R, T ), a geometric projector (GP) transforms each
segmented pixel pik = (u, v) ∈ It
Si labeled with
caption ct
ik into a 3D point Pik via:
Pik = R · ID(u, v) · K−1 · p + T, (6)
where K is the intrinsic matrix of the camera, while
R ∈ SO(3) and T ∈ R3 represent the agent’s in-
stantaneous orientation and position in world co-
ordinates. Mapping the object caption from 2D
masks to 3D point cloud, a local semantic point
cloud {(Pik, ct
ik)|i = 1, . . . , n, k = 1, . . . , m} is
constructed, where n is the number of panoramic
images and m is the number of pixels.
4.2 Hierarchical Semantic Planning Module
4.2.1 Landmark-level Planning
Since aerial VLN tasks typically involve long-
range decision-making (Liu et al., 2023b; Chen
et al., 2019), directly assigning the entire naviga-
tion task to the agent can hinder accurate alignment
with the instructions and task progress tracking. A
more effective approach is to decompose the task
into a sequence of manageable sub-goals. By ad-
dressing these sub-goals step by step, the agent
can progressively reach the final destination. To
achieve this, we propose a landmark-level plan-
ner driven by LLM (Achiam et al., 2023) to parse
free-form instructions T and extract a sequence of
landmark phases L along the path through prompt
engineering. These landmarks act as sub-goals for
the agent. We present a simple prompt as follows
(more details in Appendix A):
You need to extract a landmark sequence
from the given instruction. The sequence
order should be consistent with their appear-
ance order on the path.
4.2.2 Object-level Planning
After landmark-level planning and obtaining a se-
quence of sub-goals, the object-level planner OP(·)
employs the LLM to further decompose these sub-
goals into more achievable steps for the agent. The
key idea is to leverage the commonsense knowl-
edge of the LLM to reason for the visible object
region most pertinent to the invisible sub-goal in
the current panorama. This region is referred to as
the object region of interest (OROI) in this paper.
For example, if the agent only sees the buildings
and a road in the current view while its sub-goal
is the traffic light, by commonsense reasoning, the
next OROI it should go is the road. We design a
prompt that comprises the original navigation in-
struction T , scene object captions ct, and current
sub-goals Li for OP(·) to reason for OROI ct
OROI ,
which is given by:
ct
OROI = OP(T, Li, ct), (7)
Its template is (more details in Appendix A.1):
The navigation instruction is: .... Your next
navigation subgoal is: ... Objects or areas
you observed: ...
Based on the instruction, next sub-
goal, and observation, list 3 objects most
pertinent to the subgoal or you will probably
go next from your [Observed Objects].
Output should be in descending order of
probability.
We select the OROI with the highest possibility
given by LLM to the next landmark as the next
waypoint for the agent.
4.2.3 Motion-level Planning
Motion-level planning is responsible for translat-
ing the output of high-level planning modules into
reachable waypoints and executable actions for
the agent. Given a reasoned ct
OROI , the motion-
level planner first determines corresponding points
{(Pk, ct
k)|ct
k == ct
OROI } from the semantic point
cloud in §4.1 and compute the next waypoint by
averaging the coordinates of selected points. Then,
the planner decomposes the path to the waypoint
into a sequence of executable actions for the agent.
If the agent has reached a location close to the
memory graph, the motion planner will directly
use the memory graph to predict the agent’s future
actions, which is introduced in the next section.
4.3 Global Memory Module
Since the agent sometimes revisits the target or
landmarks, we designed a global memory module
with a memory graph that stores historical trajecto-
ries, which helps to reduce the action space in mo-
tion planning and improves navigation robustness.
Different from prior works that rely on predefined
memory graphs (Chen et al., 2021a, 2022) or 2D
grid maps (Wang et al., 2023b) lacking height infor-
mation, our approach constructs a 3D topological
memory graph from the agent’s navigation history.
Memory Graph Construction Each historical
trajectory Hi can be represented as a topologi-
cal graph Gi(Ni, Ei) whose nodes Ni encapsulate
both the coordinates of the traversed waypoints
and their panoramic observations, and edges Ei are
weighted by the distance between adjacent way-
points. The memory graph M is constructed by
merging all the historical trajectory graphs, given
by:
M = G(N, E),
N = N1 ∪ · · · ∪ Nd,
E = E1 ∪ · · · ∪ Ed,
(8)
where d is the number of historical trajectories.
Memory Graph Update The memory graph is up-
dated progressively by merging newly generated
historical trajectory graph Ghist. The merging pro-
cess is similar to Equation 8 merging the nodes and
edges of two graphs. In addition, it will generate
new edges if M and Ghist are connective. We cal-
culate the distance between every pair of nodes in
the two graphs. If the distance between any pair of
nodes is less than a threshold H=15m, it is inferred
that these nodes are adjacent and a new edge is
added to the merged graph. Note that the mem-
ory graph only merges trajectories that successfully
navigate to the target, ensuring the validity of the
waypoints in the memory graph.
Memory Graph Search for Motion-level Plan-
ning When the agent reaches a waypoint in the
memory graph, the agent directly leverages the
graph to determine a path and action sequence to
fulfill the remaining sub-goals from the landmark-
level planner. Given a sequence of remaining sub-
goals L(r) = {l1, . . . , lr} represented by land-
mark phases, our target is to find a path V ∗ =
{v1, . . . , vd} ⊆ N with the highest possibility of
traversing Lr in order. Note that each node Ni in
the graph contains visual observation ovi of the sur-
roundings. Therefore, the possibility of traversing
landmark lj by a node Ni can be formulated as
P (lj |oi). And the objective function can be formu-
lated as:
V ∗ = max
V
rY
k=1,1≤m1<···<mr ≤d
P (lk|ovmk ). (9)
We leverage the graph search algorithm in LM-Nav
(Shah et al., 2023) to solve this problem, which
is more detailed in Alg. 1 in Appendix A.2.1.
Once V ∗ is obtained, the agent decomposes it into
an executable action sequence to reach the target.
To reduce the high complexity and inefficiency of
searching over the full global memory graph, the
agent extracts a subgraph for navigation. The agent
first selects nodes and edges within a radius R to
form a spherical subgraph. It then computes the
semantic relevance between subgraph nodes and
the landmarks, and applies non-maximum suppres-
sion to prune redundant nodes. This results in a
sparse memory subgraph for efficient and accurate
navigation. More details in Appendix A.2.2.
To conclude, with the two specially designed per-
ception and planning modules, along with the mem-
ory module, the aforementioned key challenges of
the aerial VLN are addressed one by one.翻译一下