Abstract:In the field of autonomous driving, Visual-and-Language Navigation (VLN) is a typical multimodal task. In the VLN task, an intelligent vehicle needs to find the target location based on user-provided navigation instructions. However, conventional VLN models generally face the problem of limited generalization ability when dealing with a large number of real-world environmental objects and language instructions. This paper proposed a novel VLN system based on large-scale pre-trained models and applied it to intelligent vehicles. The method consists of an Instruction Extraction System, a Vision-Language Association System, and a Navigational Decisions System. Specifically, a pre-trained Large Language Model (LLM) is first used to extract a series of landmark names from the user’s natural language instructions. Then, the landmark name list is inputted into a pre-trained Visual-Language Model (VLM) to infer the joint probability with environmental objects. Therefore, the image nodes that match the landmarks have been selected. Additionally, the selected image nodes are inputted into another VLM to obtain descriptions of the image nodes. Finally, LLM is used to reason navigation actions for the intelligent vehicle. With the reasoning ability of LLM, the intelligent vehicle takes navigation knowledge, visual environment descriptions, and navigation history as inputs to output navigation actions. The simulation results demonstrate that compared to other fully supervised learning methods, this approach exhibits better generalization ability in unknown environments. Based on Google Map’s Street View data, it achieves a 14.6% higher task success rate compared to the baseline model VLN Transformer.

Depth-Aware Vision-and-Language Navigation Using Scene Query Attention Network

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Depth-Guided AdaIN and Shift Attention Network for Vision-And-Language Navigation

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Vision and Language Navigation Using Multi-head Attention Mechanism

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Real-time Vision-Language-Navigation based on a Lite Pre-training Model

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

Local Slot Attention for Vision-and-Language Navigation

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models