Abstract:Vision-language navigation (VLN) is a critical domain within embedded intelligence, requiring agents to navigate 3D environments based on natural language instructions. Traditional VLN research has focused on improving environmental understanding and decision accuracy. However, these approaches often exhibit a significant performance gap when agents are deployed in novel environments, mainly due to the limited diversity of training data. Expanding datasets to cover a broader range of environments is impractical and costly. We propose the Vision-Language Navigation with Continual Learning (VLNCL) paradigm to address this challenge. In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge. VLNCL enables agents to maintain an environmental memory and extract relevant knowledge, allowing rapid adaptation to new environments while preserving existing information. We introduce a novel dual-loop scenario replay method (Dual-SR) inspired by brain memory replay mechanisms integrated with VLN agents. This method facilitates consolidating past experiences and enhances generalization across new tasks. By utilizing a multi-scenario memory buffer, the agent efficiently organizes and replays task memories, thereby bolstering its ability to adapt quickly to new environments and mitigating catastrophic forgetting. Our work pioneers continual learning in VLN agents, introducing a novel experimental setup and evaluation metrics. We demonstrate the effectiveness of our approach through extensive evaluations and establish a benchmark for the VLNCL paradigm. Comparative experiments with existing continual learning and VLN methods show significant improvements, achieving state-of-the-art performance in continual learning ability and highlighting the potential of our approach in enabling rapid adaptation while preserving prior knowledge.

Real-time Vision-Language-Navigation based on a Lite Pre-training Model

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

Knowledge distilled pre-training model for vision-language-navigation

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

Depth-Aware Vision-and-Language Navigation Using Scene Query Attention Network

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

3D Scene Graph Guided Vision-Language Pre-training

VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

Vision-Language Navigation Policy Learning and Adaptation

Vision-Language Navigation with Continual Learning

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

Vision-and-Language Navigation Generative Pretrained Transformer