Abstract:Vision-and-Language Navigation (VLN), as a widely discussed research direction in embodied intelligence, aims to enable embodied agents to navigate in complicated visual environments through natural language commands. Most existing VLN methods focus on indoor ground robot scenarios. However, when applied to UAV VLN in outdoor urban scenes, it faces two significant challenges. First, urban scenes contain numerous objects, which makes it challenging to match fine-grained landmarks in images with complex textual descriptions of these landmarks. Second, overall environmental information encompasses multiple modal dimensions, and the diversity of representations significantly increases the complexity of the encoding process. To address these challenges, we propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model. NavAgent undertakes navigation tasks by synthesizing multi-scale environmental information, including topological maps (global), panoramas (medium), and fine-grained landmarks (local). Specifically, we utilize GLIP to build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks. Subsequently, we develop dynamically growing scene topology map that integrate environmental information and employ Graph Convolutional Networks to encode global environmental data. In addition, to train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes. In experiments conducted on the Touchdown and Map2seq datasets, NavAgent outperforms strong baseline models. The code and dataset will be released to the community to facilitate the exploration and development of outdoor VLN.

Visual-and-Language Multimodal Fusion for Sweeping Robot Navigation Based on CNN and GRU

Visual Language Maps for Robot Navigation

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models

Multimodal audio-visual robot fusing 3D CNN and CRNN for player behavior recognition and prediction in basketball matches

Audio Visual Language Maps for Robot Navigation

Multimodal integration learning of robot behavior using deep neural networks

L3MVN: Leveraging Large Language Models for Visual Target Navigation

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

Multimodal sensory fusion for soccer robot self-localization based on long short-term memory recurrent neural network

Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation

IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation

Decision Making of Mobile Robot based on Multimodal Fusion

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Bilateral Cross-Modal Fusion Network for Robot Grasp Detection

A Hybrid Approach to Real-Time Robotic Visual Navigation: Integrating Detection and Scene Segmentation

A perceptual manipulation system for audio-visual fusion of robots

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Multi-model fusion for Aerial Vision and Dialog Navigation based on human attention aids

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation