Abstract:Vision-language navigation (VLN) is a critical domain within embedded intelligence, requiring agents to navigate 3D environments based on natural language instructions. Traditional VLN research has focused on improving environmental understanding and decision accuracy. However, these approaches often exhibit a significant performance gap when agents are deployed in novel environments, mainly due to the limited diversity of training data. Expanding datasets to cover a broader range of environments is impractical and costly. We propose the Vision-Language Navigation with Continual Learning (VLNCL) paradigm to address this challenge. In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge. VLNCL enables agents to maintain an environmental memory and extract relevant knowledge, allowing rapid adaptation to new environments while preserving existing information. We introduce a novel dual-loop scenario replay method (Dual-SR) inspired by brain memory replay mechanisms integrated with VLN agents. This method facilitates consolidating past experiences and enhances generalization across new tasks. By utilizing a multi-scenario memory buffer, the agent efficiently organizes and replays task memories, thereby bolstering its ability to adapt quickly to new environments and mitigating catastrophic forgetting. Our work pioneers continual learning in VLN agents, introducing a novel experimental setup and evaluation metrics. We demonstrate the effectiveness of our approach through extensive evaluations and establish a benchmark for the VLNCL paradigm. Comparative experiments with existing continual learning and VLN methods show significant improvements, achieving state-of-the-art performance in continual learning ability and highlighting the potential of our approach in enabling rapid adaptation while preserving prior knowledge.

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

ESceme: Vision-and-Language Navigation with Episodic Scene Memory

Structured Scene Memory for Vision-Language Navigation

Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation

Reinforced Structured State-Evolution for Vision-Language Navigation

Vision-Dialog Navigation by Exploring Cross-modal Memory.

Vision-Language Navigation with Continual Learning

DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

Bird's-Eye-View Scene Graph for Vision-Language Navigation

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

Continual Vision-and-Language Navigation

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

GridMM: Grid Memory Map for Vision-and-Language Navigation

OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation

MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains

Active Perception for Visual-Language Navigation

Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People