Vision-and-Language Navigation Generative Pretrained Transformer

Wen Hanlin
DOI: https://doi.org/10.48550/arXiv.2405.16994
2024-05-27
Abstract:In the Vision-and-Language Navigation (VLN) field, agents are tasked with navigating real-world scenes guided by linguistic instructions. Enabling the agent to adhere to instructions throughout the process of navigation represents a significant challenge within the domain of VLN. To address this challenge, common approaches often rely on encoders to explicitly record past locations and actions, increasing model complexity and resource consumption. Our proposal, the Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT), adopts a transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. This method allows for direct historical information access through trajectory sequence, enhancing efficiency. Furthermore, our model separates the training process into offline pre-training with imitation learning and online fine-tuning with reinforcement learning. This distinction allows for more focused training objectives and improved performance. Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to enable an agent to navigate in real - scene according to natural language instructions in the Vision - and - Language Navigation (VLN) task. Specifically, the paper focuses on how to make the agent continuously follow the instructions throughout the navigation process, which is a major challenge in the VLN field. To address this challenge, existing methods usually rely on encoders to explicitly record past locations and actions, which increases the complexity and resource consumption of the model. The solution proposed in this paper is Vision - and - Language Navigation Generative Pretrained Transformer (VLN - GPT), which adopts a GPT2 - based decoder model to model trajectory sequence dependencies, thus avoiding the need for a historical encoding module. This method directly accesses historical information through the trajectory sequence, improving efficiency. In addition, the model also divides the training process into two stages: offline pre - training and online fine - tuning. Offline pre - training uses Imitation Learning (IL), and online fine - tuning uses Reinforcement Learning (RL). This distinction makes the training objectives more focused and also improves performance. Through evaluation on the VLN dataset, the performance of VLN - GPT exceeds that of complex encoder - based state - of - the - art models.