PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation

Jialu Li,Mohit Bansal
2023-05-31
Abstract:Vision-and-Language Navigation (VLN) requires the agent to follow language instructions to navigate through 3D environments. One main challenge in VLN is the limited availability of photorealistic training environments, which makes it hard to generalize to new and unseen environments. To address this problem, we propose PanoGen, a generation method that can potentially create an infinite number of diverse panoramic environments conditioned on text. Specifically, we collect room descriptions by captioning the room images in existing Matterport3D environments, and leverage a state-of-the-art text-to-image diffusion model to generate the new panoramic environments. We use recursive outpainting over the generated images to create consistent 360-degree panorama views. Our new panoramic environments share similar semantic information with the original environments by conditioning on text descriptions, which ensures the co-occurrence of objects in the panorama follows human intuition, and creates enough diversity in room appearance and layout with image outpainting. Lastly, we explore two ways of utilizing PanoGen in VLN pre-training and fine-tuning. We generate instructions for paths in our PanoGen environments with a speaker built on a pre-trained vision-and-language model for VLN pre-training, and augment the visual observation with our panoramic environments during agents' fine-tuning to avoid overfitting to seen environments. Empirically, learning with our PanoGen environments achieves the new state-of-the-art on the Room-to-Room, Room-for-Room, and CVDN datasets. Pre-training with our PanoGen speaker data is especially effective for CVDN, which has under-specified instructions and needs commonsense knowledge. Lastly, we show that the agent can benefit from training with more generated panoramic environments, suggesting promising results for scaling up the PanoGen environments.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitation of the training environment in the Vision - and - Language Navigation (VLN) task. Specifically, most of the existing VLN datasets are based on the Matterport3D environment. These environments are limited in number and difficult to expand, resulting in poor performance of the model when generalizing to new, unseen environments. To solve this problem, the authors propose the PANO GEN method, which aims to generate infinitely diverse panoramic environments through text - conditional generation. This method can not only increase the diversity of training data but also improve the generalization ability of the model in new environments. The following are the main contents of this method: 1. **Collect room descriptions**: - Use the pre - trained vision - language model BLIP - 2 to label the room images in the Matterport3D dataset and generate detailed room descriptions. - Each panorama is discretized into 36 views, and a separate description is generated for each view. 2. **Generate panoramas**: - Utilize the state - of - the - art text - to - image diffusion model (such as Stable Diffusion) to generate a single view according to the text description. - In order to ensure that the generated panoramas are consistent between different views, a recursive outpainting method is proposed to gradually expand the boundaries of the generated image and finally synthesize a coherent 360 - degree panorama. 3. **Use the generated panoramic environments for training**: - **Pre - training stage**: Train a speaker model to generate path instructions based on the pre - trained vision - language model mPLUG, which is used to enhance the pre - training of the VLN agent. - **Fine - tuning stage**: During the fine - tuning process, randomly replace part of the original observation data with the generated panoramic environments to avoid over - fitting and improve generalization ability. The experimental results show that training with the panoramic environments generated by PANO GEN can significantly improve the performance of the VLN agent on the Room - to - Room (R2R) and Cooperative Vision - and - Dialog Navigation (CVDN) datasets, especially when dealing with ambiguous instructions. In summary, this paper solves the problem of insufficient training environments in the existing VLN tasks by introducing a new method for generating panoramic environments and improves the generalization ability of the model in new environments.