Abstract:This paper addresses the challenge of fine-grained alignment in Vision-and-Language Navigation (VLN) tasks, where robots navigate realistic 3D environments based on natural language instructions. Current approaches use contrastive learning to align language with visual trajectory sequences. Nevertheless, they encounter difficulties with fine-grained vision negatives. To enhance cross-modal embeddings, we introduce a novel Bayesian Optimization-based adversarial optimization framework for creating fine-grained contrastive vision samples. To validate the proposed methodology, we conduct a series of experiments to assess the effectiveness of the enriched embeddings on fine-grained vision negatives. We conduct experiments on two common VLN benchmarks R2R and REVERIE, experiments on the them demonstrate that these embeddings benefit navigation, and can lead to a promising performance enhancement. Our source code and trained models are available at: <a class="link-external link-https" href="https://anonymous.4open.science/r/FGVLN" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the challenges of **Fine - Grained Alignment** in the Vision - and - Language Navigation (VLN) task. Specifically, robots need to navigate in real 3D environments according to natural language instructions. Current methods mainly use contrastive learning to align language and visual trajectory sequences, but encounter difficulties when dealing with fine - grained visual negative samples. #### Main problems: 1. **Generation of fine - grained visual negative samples**: Existing methods can only generate coarse - grained visual negative samples, and these samples cannot fully improve the performance of the model. 2. **Improvement of the quality of cross - modal embeddings**: In order to enhance cross - modal embeddings (i.e., visual and linguistic representations), higher - quality fine - grained visual negative samples need to be generated. 3. **Overall performance improvement of the navigation task**: Improve the performance of the navigation task by improving the embedding quality. #### Solutions: To solve the above problems, the authors introduce an adversarial optimization framework based on Bayesian Optimization (BO) for generating fine - grained contrastive visual samples. This framework iteratively finds the frames that have the greatest impact on model prediction and replaces these frames to form fine - grained visual negative samples. In this way, the model can better capture fine - grained visual information during the training process, thereby improving the performance of the navigation task. #### Experimental verification: To verify the effectiveness of the proposed method, the authors conducted experiments on two common VLN benchmark datasets, R2R and REVERIE. The experimental results show that the fine - grained embeddings generated by this method can significantly improve the performance of the navigation task, especially in path selection and instruction following. ### Summary: The core problem of this paper is to improve the quality of cross - modal embeddings in the Vision - and - Language Navigation task by generating fine - grained visual negative samples, and then improve the overall navigation performance. The adversarial training framework based on Bayesian Optimization proposed by the authors effectively solves this problem and achieves significant performance improvements on multiple benchmark datasets. --- If you have more specific questions or need further explanation, please feel free to let us know!

Fine-Grained Alignment in Vision-and-Language Navigation through Bayesian Optimization

DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Vision-Language Navigation Policy Learning and Adaptation

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Scaling Vision-and-Language Navigation With Offline RL

Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

Vision-and-Language Navigation via Causal Learning

Learning Vision-and-Language Navigation from YouTube Videos

LangNav: Language as a Perceptual Representation for Navigation

Active Visual Information Gathering for Vision-Language Navigation

Continual Vision-and-Language Navigation

Multimodal Attention Networks for Low-Level Vision-and-Language Navigation

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web