Abstract:Existing work on vision and language navigation mainly relies on navigation-related losses to establish the connection between vision and language modalities, neglecting aspects of helping the navigation agent build a deep understanding of the visual environment. In our work, we provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions. The hint generator assists the navigation agent in developing a global understanding of the visual environment. It directs the agent's attention toward related navigation details, including the relevant sub-instruction, potential challenges in recognition and ambiguities in grounding, and the targeted viewpoint description. To train the hint generator, we construct a synthetic dataset based on landmarks in the instructions and visible and distinctive objects in the visual environment. We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics. The experimental results demonstrate that generating hints not only enhances the navigation performance but also helps improve the interpretability of the agent's actions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in visual - and - language navigation tasks, existing methods mainly rely on navigation - related losses to establish the connection between visual and language modalities, but ignore helping the navigation agent build an in - depth understanding of the visual environment. Specifically, most existing studies supervise the learning of the connection between visual and language modalities through navigation performance (such as the distance to the destination, direction selection, and the similarity between a given instruction and the trajectory), but this does not directly promote comprehensive learning of text and visual semantics. This is not only crucial for successfully completing navigation tasks, but also very important for effective communication with humans. For example, the navigation agent should be able to correctly locate the navigation progress according to the current visual view; in addition, the navigation agent needs to examine the environment from a global perspective to determine whether the navigable viewpoints contain relevant landmarks or whether the instructions are ambiguous. In any case, the agent should be able to describe its target viewpoint. It is challenging to expect the navigation agent to obtain the above understanding only through navigation - related signals, so intermediate guidance is required. To this end, the authors introduce a hint generator named NavHint, which aims to generate visual descriptions as indirect supervision to help the navigation agent better understand the visual environment. When the agent moves in each navigation step, the hint generator will simultaneously generate visual descriptions that are consistent with the agent's action decisions. These hints are designed based on the logic behind the navigation process and include three aspects: sub - instructions, landmark ambiguity, and target unique objects. Specifically: 1. **Sub - instructions**: Encourage the agent to report its navigation progress by specifying the part of the sub - instruction to be executed according to the current visual environment. 2. **Landmark ambiguity**: Guide the agent to examine the entire environment from a global perspective and identify the landmarks mentioned in the instructions from all candidate viewpoints. The agent needs to identify potential challenges, evaluate the visibility of the landmarks, and compare the differences in the landmarks shared by each viewpoint. 3. **Target unique objects**: In the presence of challenges, guide the agent to describe the unique visual objects that only appear in the target viewpoint, helping the agent gain in - depth understanding of the details of the selected viewpoint while comparing it with other candidate viewpoints from a global perspective. In this way, the hint generator not only improves navigation performance, but also enhances the interpretability of the agent's behavior.

NavHint: Vision and Language Navigation Agent with a Hint Generator

Boosting Vision-and-Language Navigation with Direction Guiding and Backtracing

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Diagnosing Vision-and-Language Navigation: What Really Matters

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Active Visual Information Gathering for Vision-Language Navigation

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

Narrowing the Gap between Vision and Action in Navigation

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

Self-Monitoring Navigation Agent Via Auxiliary Progress Estimation

Enhancing Vision and Language Navigation with Prompt-based Scene Knowledge

Towards Navigation by Reasoning over Spatial Configurations

Active Perception for Visual-Language Navigation

Vision Language Navigation with Multi-granularity Observation and Auxiliary Reasoning Tasks

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

VLN-Trans: Translator for the Vision and Language Navigation Agent

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

LangNav: Language as a Perceptual Representation for Navigation