Abstract:In the field of autonomous driving, Visual-and-Language Navigation (VLN) is a typical multimodal task. In the VLN task, an intelligent vehicle needs to find the target location based on user-provided navigation instructions. However, conventional VLN models generally face the problem of limited generalization ability when dealing with a large number of real-world environmental objects and language instructions. This paper proposed a novel VLN system based on large-scale pre-trained models and applied it to intelligent vehicles. The method consists of an Instruction Extraction System, a Vision-Language Association System, and a Navigational Decisions System. Specifically, a pre-trained Large Language Model (LLM) is first used to extract a series of landmark names from the user’s natural language instructions. Then, the landmark name list is inputted into a pre-trained Visual-Language Model (VLM) to infer the joint probability with environmental objects. Therefore, the image nodes that match the landmarks have been selected. Additionally, the selected image nodes are inputted into another VLM to obtain descriptions of the image nodes. Finally, LLM is used to reason navigation actions for the intelligent vehicle. With the reasoning ability of LLM, the intelligent vehicle takes navigation knowledge, visual environment descriptions, and navigation history as inputs to output navigation actions. The simulation results demonstrate that compared to other fully supervised learning methods, this approach exhibits better generalization ability in unknown environments. Based on Google Map’s Street View data, it achieves a 14.6% higher task success rate compared to the baseline model VLN Transformer.

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Real-time Vision-Language-Navigation based on a Lite Pre-training Model

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

Vision-Language Navigation Policy Learning and Adaptation

Multimodal Large Language Model for Visual Navigation

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

Visual Perception Generalization for Vision-and-Language Navigation via Meta-Learning

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

VLP: Vision Language Planning for Autonomous Driving

Depth-Aware Vision-and-Language Navigation Using Scene Query Attention Network