Abstract:Vision-and-Language Navigation (VLN) requires the agent to follow language instructions to navigate through 3D environments. One main challenge in VLN is the limited availability of photorealistic training environments, which makes it hard to generalize to new and unseen environments. To address this problem, we propose PanoGen, a generation method that can potentially create an infinite number of diverse panoramic environments conditioned on text. Specifically, we collect room descriptions by captioning the room images in existing Matterport3D environments, and leverage a state-of-the-art text-to-image diffusion model to generate the new panoramic environments. We use recursive outpainting over the generated images to create consistent 360-degree panorama views. Our new panoramic environments share similar semantic information with the original environments by conditioning on text descriptions, which ensures the co-occurrence of objects in the panorama follows human intuition, and creates enough diversity in room appearance and layout with image outpainting. Lastly, we explore two ways of utilizing PanoGen in VLN pre-training and fine-tuning. We generate instructions for paths in our PanoGen environments with a speaker built on a pre-trained vision-and-language model for VLN pre-training, and augment the visual observation with our panoramic environments during agents' fine-tuning to avoid overfitting to seen environments. Empirically, learning with our PanoGen environments achieves the new state-of-the-art on the Room-to-Room, Room-for-Room, and CVDN datasets. Pre-training with our PanoGen speaker data is especially effective for CVDN, which has under-specified instructions and needs commonsense knowledge. Lastly, we show that the agent can benefit from training with more generated panoramic environments, suggesting promising results for scaling up the PanoGen environments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitation of the training environment in the Vision - and - Language Navigation (VLN) task. Specifically, most of the existing VLN datasets are based on the Matterport3D environment. These environments are limited in number and difficult to expand, resulting in poor performance of the model when generalizing to new, unseen environments. To solve this problem, the authors propose the PANO GEN method, which aims to generate infinitely diverse panoramic environments through text - conditional generation. This method can not only increase the diversity of training data but also improve the generalization ability of the model in new environments. The following are the main contents of this method: 1. **Collect room descriptions**: - Use the pre - trained vision - language model BLIP - 2 to label the room images in the Matterport3D dataset and generate detailed room descriptions. - Each panorama is discretized into 36 views, and a separate description is generated for each view. 2. **Generate panoramas**: - Utilize the state - of - the - art text - to - image diffusion model (such as Stable Diffusion) to generate a single view according to the text description. - In order to ensure that the generated panoramas are consistent between different views, a recursive outpainting method is proposed to gradually expand the boundaries of the generated image and finally synthesize a coherent 360 - degree panorama. 3. **Use the generated panoramic environments for training**: - **Pre - training stage**: Train a speaker model to generate path instructions based on the pre - trained vision - language model mPLUG, which is used to enhance the pre - training of the VLN agent. - **Fine - tuning stage**: During the fine - tuning process, randomly replace part of the original observation data with the generated panoramic environments to avoid over - fitting and improve generalization ability. The experimental results show that training with the panoramic environments generated by PANO GEN can significantly improve the performance of the VLN agent on the Room - to - Room (R2R) and Cooperative Vision - and - Dialog Navigation (CVDN) datasets, especially when dealing with ambiguous instructions. In summary, this paper solves the problem of insufficient training environments in the existing VLN tasks by introducing a new method for generating panoramic environments and improves the generalization ability of the model in new environments.

PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation

A Panoramic Localizer Based on Coarse-to-Fine Descriptors for Navigation Assistance

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

EnvEdit: Environment Editing for Vision-and-Language Navigation

Volumetric Environment Representation for Vision-Language Navigation

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

Neural Rendering in a Room: Amodal 3D Understanding and Free-Viewpoint Rendering for the Closed Scene Composed of Pre-Captured Objects

Scaling Data Generation in Vision-and-Language Navigation

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Vision-Language Navigation with Continual Learning

Continual Vision-and-Language Navigation

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

Language-guided Navigation Via Cross-Modal Grounding and Alternate Adversarial Learning

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration

World-Consistent Data Generation for Vision-and-Language Navigation