Abstract:LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) task. However, existing LLM-based methods often focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in navigation scenarios. To bridge this gap, we propose AO-Planner, a novel Affordances-Oriented Planner for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making, both performed in a zero-shot setting. Specifically, we employ a Visual Affordances Prompting (VAP) approach, where the visible ground is segmented by SAM to provide navigational affordances, based on which the LLM selects potential candidate waypoints and plans low-level paths towards selected waypoints. We further propose a high-level PathAgent which marks planned paths into the image input and reasons the most probable path by comprehending all environmental information. Finally, we convert the selected path into 3D coordinates using camera intrinsic parameters and depth information, avoiding challenging 3D predictions for LLMs. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance (8.8% improvement on SPL). Our method can also serve as a data annotator to obtain pseudo-labels, distilling its waypoint prediction ability into a learning-based predictor. This new predictor does not require any waypoint data from the simulator and achieves 47% SR competing with supervised methods. We establish an effective connection between LLM and 3D world, presenting novel prospects for employing foundation models in low-level motion control.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address the issue of how to utilize large-scale language models (LLM) for low-level motion planning in the task of Continuous Vision-Language Navigation (VLN-CE). Specifically, existing LLM-based methods typically focus only on high-level task planning by selecting nodes in a predefined navigation graph to move, while neglecting low-level control in the navigation scene. To solve this problem, the authors propose AO-Planner, a new capability-based planner designed to bridge the gap between high-level decision-making and low-level motion planning. ### Main Contributions 1. **AO-Planner**: A new zero-shot capability-oriented planning framework, AO-Planner, is proposed. This framework utilizes foundational models for capability-oriented planning and converts predictions in the RGB space to 3D coordinates, thus bridging the gap between LLM high-level decision-making and 3D world navigation. 2. **Visual Ability Prompting (VAP)**: A new visual ability prompting method is proposed to unleash the spatial understanding and reasoning capabilities of LLM, revealing new potential for LLM in low-level motion planning. 3. **Zero-shot Performance**: AO-Planner achieves state-of-the-art zero-shot performance without requiring any simulator data and can provide reliable pseudo-labels for training supervised models to achieve competitive performance. ### Method Overview 1. **Task Definition**: - In each episode, the agent needs to move from the starting point to the target location based on fixed instructions. - At each step t, the agent can obtain observations Ot from different directions, including four non-overlapping views (front, back, left, right). - The action space includes four parameterized low-level actions: FORWARD (0.25 meters), ROTATE LEFT/RIGHT (15°), and STOP. 2. **Framework Overview**: - Utilize the Grounded SAM model to obtain navigation capabilities and sample some points from it for the LLM agent to select. - Design prompts for the low-level agent to search for potential waypoints and plan corresponding paths. - Visualize candidate results and input them into the second-stage high-level agent, which combines instructions and historical information to make the final movement decision. - Based on depth information and camera intrinsics, convert the points predicted in the RGB space in the first stage to 3D world coordinates, further converting them to low-level actions. 3. **Visual Ability Prompting (VAP)**: - Distribute a set of points evenly in each view and label these points sequentially to generate enhanced views. - Query the LLM to select appropriate points and provide their corresponding IDs without directly predicting their coordinates in the RGB space or 3D world. - Combine the current task instructions and low-level task descriptions, requiring the LLM to select waypoints and plan paths. 4. **High-level PathAgent**: - After obtaining potential low-level waypoints and path predictions, introduce another agent, PathAgent, for high-level decision-making. - Visualize the candidate waypoints and paths predicted in the first stage and label their IDs to generate enhanced observations. - Input high-level task descriptions, instructions, and historical information, requiring the LLM to output an interpretable thought process and select a path as the action. 5. **3D Mapping and Motion Control**: - Once the agent selects a waypoint as a sub-goal, the planned path needs to be converted into a series of actions to guide the agent along the path. - Utilize camera intrinsics and depth information to map pixel coordinates to 3D world coordinates. - Calculate the relative direction and distance between two points in the world coordinates and convert them into a series of low-level ROTATE and FORWARD actions. 6. **Waypoint Distillation**: - Utilize LLM and VAP as zero-shot waypoint predictors and explore transferring this capability to a learning-based waypoint predictor. - Use the waypoints predicted by LLM as pseudo-labels to train the waypoint predictor in ETPNav and fine-tune it with a high-level VLN agent. - In this way, it is possible to achieve competitive performance without relying on a large amount of simulator data.

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

ChatNav: Leveraging LLM to Zero-shot Semantic Reasoning in Object Navigation

Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation

Object-and-Action Aware Model for Visual Language Navigation

Vision and Language Navigation in the Real World via Online Visual Language Mapping

LLM As Copilot for Coarse-grained Vision-and-Language Navigation

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

Vision-and-Language Navigation via Latent Semantic Alignment Learning

RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

Learning to Act with Affordance-Aware Multimodal Neural SLAM

DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation

Vision-Language Navigation Policy Learning and Adaptation

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation

Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs