Set goals then achieve

博主制定了从2019年5月6日至2020年1月1日的计划,包括每天下班跑步回家(除非下雨)、每天花1小时学英语、晨读、每周写一篇技术博客等。目标是减重10kg、成为技术分享专家、流利使用英语、获得PMP认证。

My Plan

  1. Run home every day unless raining . Off work time
  2. Spend at least 1 hour per day to learn English.19:00-20:00
  3. Reading every morning. From 7:00 to 8:30
  4. Writing at least  one technical Blog every week.

My 2019 Goal

  1. Lost 10kg .
  2. Become a technical share specialist.
  3. Can speak and write English fluently.
  4. Get PMP certification.

Keep on going.

This Plan implementation from May 6th 2019 to 2020-1-1

Believe yourself and you can achieve that goal

 

Signature : Mascot 

Date : 5/6/2019

3 Problem Formulation Given a language instruction I and the agent’s ego- centric observation O, the aerial VLN agent has to determine a sequence of action to reach the tar- get location pd in a continuous 3D space. At each action step t, the agent follows a policy π taking current observation ot and instruction I as input to predict the next action at and move to location pt by its kinematic model F, which is given by: pt = F(pt−1, π(ot, I)), (1) Given a sequence of action, the agent reaches a final position, and the success probability Ps of reaching the target pd is Ps = P (||F(π(p0, O, I)) − pd|| < ϵ), (2) where || · || is the Euclidean distance and ϵ is the threshold that indicates if the target is reached. Thus, the goal of VLN is to find a policy π∗ that maximizes the success rate, given by: π∗ = argmaxπPs. (3) 4 CityNavAgent In this section, we present the workflow of the pro- posed CityNavAgent for zero-shot aerial VLN in urban environments. As shown in Figure 1, City- NavAgent framework comprises three key mod- ules. 1) The open-vocabulary perception module extracts structured semantic and spatial informa- tion from its panoramic observation via a founda- tion model. 2) The hierarchical semantic planning module (HSPM) leverages LLM to decompose the navigation task into hierarchical planning sub-tasks to reduce the planning complexity and predict the intermediate waypoint for the execution module. 3) The global memory module, represented as a topological graph, stores valid waypoints extracted from historical trajectories to further reduce the action space of motion planning and enhance long- range navigation stability. 4.1 Open-vocabulary Perception Module To accurately understand complex semantics in the urban environment, we leverage the powerful open- vocabulary captioning and grounding model to ex- tract scene semantic features. Besides, integrating the scene semantics and depth information, we con- struct a 3D semantic map for further planning. Scene Semantic Perception Extracting urban scene semantics requires robust open-vocabulary object recognition. As shown in Figure 2, given a set of panoramic images {It Ri } at step t, we first Memory Graph Update History TrajectoryHistory Trajectory Hierarchical Semantic Planning Module Global Memory Module Memory Graph Search Semantic Graph Search Waypoint Sequence Object-level Planner Start Landmark Objects on the way Object-level Planner Start Landmark Objects on the wayStart End Landmarks on the path Landmark-level Planner Start End Landmarks on the path Landmark-level Planner Motion-level Planner Motion-level Planner Visual Detector (VD) Visual Detector (VD) Image Captioner (CAP) Image Captioner (CAP) Image Captioner (CAP) Segmentation (SEG) Visual Tokenizer (VT) Visual Tokenizer (VT) Semantic Point CloudSemantic Point Cloud Object Phrases Open-vocabulary Object phrases: road, white buildings Object Phrases Open-vocabulary Object phrases: road, white buildings Geometric Projector (GP) Geometric Projector (GP) Semantic masksSemantic masksRGB ImagesRGB Images Depth ImagesDepth Images Open-vocabulary Perception Module Go straight to the telephone booth at the end of the road. Then find the white statue on your right. Instruction Go straight to the telephone booth at the end of the road. Then find the white statue on your right. Instruction Next WaypointFigure 2: CityNavAgent consists of three key modules. The open-vocabulary module extracts open-vocabulary objects in the scenes and builds a semantic point cloud of the surroundings. The hierarchical semantic planning module decomposes the original instruction into sub-goals with different semantic levels and predicts the agent’s next action to achieve the high-level sub-goals. The global memory module stores historical trajectories to assist motion planning toward visited targets. use an open-vocabulary image captioner CAP(·) empowered by GPT-4V (Achiam et al., 2023) to generate object captions ct i for the image It Ri . Then, we leverage a visual detector VD(·) named Ground- ingDINO (Liu et al., 2023a), to generate the bound- ing boxes obbt i for captioned objects by obbi = VD(ct i, It Ri ), ct i = CAP(It Ri ). (4) Finally, the bounding boxes are tokenized by a visual tokenizer VT(·), which is then fed into a segmentation model SEG(·) (Kirillov et al., 2023) to generate fine-grained semantic masks It Si for objects as follows: It Si = SEG(VT(obbi), It Ri ). (5) Scene Spatial Perception Considering that ego- centric views suffer from perspective overlap (An et al., 2023) and fail to capture 3D spatial rela- tionships (Gao et al., 2024b), we construct a 3D point map by projecting segmentation masks of observations into metric 3D space. Leveraging the RGB-D sensor’s depth map It Di and agent’s pose (R, T ), a geometric projector (GP) transforms each segmented pixel pik = (u, v) ∈ It Si labeled with caption ct ik into a 3D point Pik via: Pik = R · ID(u, v) · K−1 · p + T, (6) where K is the intrinsic matrix of the camera, while R ∈ SO(3) and T ∈ R3 represent the agent’s in- stantaneous orientation and position in world co- ordinates. Mapping the object caption from 2D masks to 3D point cloud, a local semantic point cloud {(Pik, ct ik)|i = 1, . . . , n, k = 1, . . . , m} is constructed, where n is the number of panoramic images and m is the number of pixels. 4.2 Hierarchical Semantic Planning Module 4.2.1 Landmark-level Planning Since aerial VLN tasks typically involve long- range decision-making (Liu et al., 2023b; Chen et al., 2019), directly assigning the entire naviga- tion task to the agent can hinder accurate alignment with the instructions and task progress tracking. A more effective approach is to decompose the task into a sequence of manageable sub-goals. By ad- dressing these sub-goals step by step, the agent can progressively reach the final destination. To achieve this, we propose a landmark-level plan- ner driven by LLM (Achiam et al., 2023) to parse free-form instructions T and extract a sequence of landmark phases L along the path through prompt engineering. These landmarks act as sub-goals for the agent. We present a simple prompt as follows (more details in Appendix A): You need to extract a landmark sequence from the given instruction. The sequence order should be consistent with their appear- ance order on the path. 4.2.2 Object-level Planning After landmark-level planning and obtaining a se- quence of sub-goals, the object-level planner OP(·) employs the LLM to further decompose these sub- goals into more achievable steps for the agent. The key idea is to leverage the commonsense knowl- edge of the LLM to reason for the visible object region most pertinent to the invisible sub-goal in the current panorama. This region is referred to as the object region of interest (OROI) in this paper. For example, if the agent only sees the buildings and a road in the current view while its sub-goal is the traffic light, by commonsense reasoning, the next OROI it should go is the road. We design a prompt that comprises the original navigation in- struction T , scene object captions ct, and current sub-goals Li for OP(·) to reason for OROI ct OROI , which is given by: ct OROI = OP(T, Li, ct), (7) Its template is (more details in Appendix A.1): The navigation instruction is: .... Your next navigation subgoal is: ... Objects or areas you observed: ... Based on the instruction, next sub- goal, and observation, list 3 objects most pertinent to the subgoal or you will probably go next from your [Observed Objects]. Output should be in descending order of probability. We select the OROI with the highest possibility given by LLM to the next landmark as the next waypoint for the agent. 4.2.3 Motion-level Planning Motion-level planning is responsible for translat- ing the output of high-level planning modules into reachable waypoints and executable actions for the agent. Given a reasoned ct OROI , the motion- level planner first determines corresponding points {(Pk, ct k)|ct k == ct OROI } from the semantic point cloud in §4.1 and compute the next waypoint by averaging the coordinates of selected points. Then, the planner decomposes the path to the waypoint into a sequence of executable actions for the agent. If the agent has reached a location close to the memory graph, the motion planner will directly use the memory graph to predict the agent’s future actions, which is introduced in the next section. 4.3 Global Memory Module Since the agent sometimes revisits the target or landmarks, we designed a global memory module with a memory graph that stores historical trajecto- ries, which helps to reduce the action space in mo- tion planning and improves navigation robustness. Different from prior works that rely on predefined memory graphs (Chen et al., 2021a, 2022) or 2D grid maps (Wang et al., 2023b) lacking height infor- mation, our approach constructs a 3D topological memory graph from the agent’s navigation history. Memory Graph Construction Each historical trajectory Hi can be represented as a topologi- cal graph Gi(Ni, Ei) whose nodes Ni encapsulate both the coordinates of the traversed waypoints and their panoramic observations, and edges Ei are weighted by the distance between adjacent way- points. The memory graph M is constructed by merging all the historical trajectory graphs, given by: M = G(N, E), N = N1 ∪ · · · ∪ Nd, E = E1 ∪ · · · ∪ Ed, (8) where d is the number of historical trajectories. Memory Graph Update The memory graph is up- dated progressively by merging newly generated historical trajectory graph Ghist. The merging pro- cess is similar to Equation 8 merging the nodes and edges of two graphs. In addition, it will generate new edges if M and Ghist are connective. We cal- culate the distance between every pair of nodes in the two graphs. If the distance between any pair of nodes is less than a threshold H=15m, it is inferred that these nodes are adjacent and a new edge is added to the merged graph. Note that the mem- ory graph only merges trajectories that successfully navigate to the target, ensuring the validity of the waypoints in the memory graph. Memory Graph Search for Motion-level Plan- ning When the agent reaches a waypoint in the memory graph, the agent directly leverages the graph to determine a path and action sequence to fulfill the remaining sub-goals from the landmark- level planner. Given a sequence of remaining sub- goals L(r) = {l1, . . . , lr} represented by land- mark phases, our target is to find a path V ∗ = {v1, . . . , vd} ⊆ N with the highest possibility of traversing Lr in order. Note that each node Ni in the graph contains visual observation ovi of the sur- roundings. Therefore, the possibility of traversing landmark lj by a node Ni can be formulated as P (lj |oi). And the objective function can be formu- lated as: V ∗ = max V rY k=1,1≤m1<···<mr ≤d P (lk|ovmk ). (9) We leverage the graph search algorithm in LM-Nav (Shah et al., 2023) to solve this problem, which is more detailed in Alg. 1 in Appendix A.2.1. Once V ∗ is obtained, the agent decomposes it into an executable action sequence to reach the target. To reduce the high complexity and inefficiency of searching over the full global memory graph, the agent extracts a subgraph for navigation. The agent first selects nodes and edges within a radius R to form a spherical subgraph. It then computes the semantic relevance between subgraph nodes and the landmarks, and applies non-maximum suppres- sion to prune redundant nodes. This results in a sparse memory subgraph for efficient and accurate navigation. More details in Appendix A.2.2. To conclude, with the two specially designed per- ception and planning modules, along with the mem- ory module, the aforementioned key challenges of the aerial VLN are addressed one by one.翻译一下
12-18
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值