A Global-Memory-Aware Transformer for Vision-and-Language Navigation

Le Wang,Xiaofeng Wu
DOI: https://doi.org/10.1109/ainit61980.2024.10581832
2024-01-01
Abstract:Vision-and-Language Navigation (VLN) needs an agent to navigate in 3D environments under the guidance of natural language instructions, continuously exploring until reaching a specified destination. Historical navigation memory plays a crucial role in indicating the current navigation progress, yet existing VLN methods often only incorporate representations of the immediate past time step, insufficient for capturing long-term temporal context. This paper introduces a Transformer network with a global historical context memory, which adaptively selects the most relevant episodes from history, encoding global memory information to effectively model long-term dependencies for VLN models. We evaluate the proposed model on the popular R2R dataset, achieving an absolute improvement of 3.7% in SR and 3.3% in SPL in seen environments. In novel environments, SR improved by 2.2%, and SPL increased by 0.83%. These results demonstrate the effectiveness of the proposed model in leveraging historical memory to significantly enhance navigation performance.
What problem does this paper attempt to address?