NavHint: Vision and Language Navigation Agent with a Hint Generator

Yue Zhang,Quan Guo,Parisa Kordjamshidi
2024-02-05
Abstract:Existing work on vision and language navigation mainly relies on navigation-related losses to establish the connection between vision and language modalities, neglecting aspects of helping the navigation agent build a deep understanding of the visual environment. In our work, we provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions. The hint generator assists the navigation agent in developing a global understanding of the visual environment. It directs the agent's attention toward related navigation details, including the relevant sub-instruction, potential challenges in recognition and ambiguities in grounding, and the targeted viewpoint description. To train the hint generator, we construct a synthetic dataset based on landmarks in the instructions and visible and distinctive objects in the visual environment. We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics. The experimental results demonstrate that generating hints not only enhances the navigation performance but also helps improve the interpretability of the agent's actions.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in visual - and - language navigation tasks, existing methods mainly rely on navigation - related losses to establish the connection between visual and language modalities, but ignore helping the navigation agent build an in - depth understanding of the visual environment. Specifically, most existing studies supervise the learning of the connection between visual and language modalities through navigation performance (such as the distance to the destination, direction selection, and the similarity between a given instruction and the trajectory), but this does not directly promote comprehensive learning of text and visual semantics. This is not only crucial for successfully completing navigation tasks, but also very important for effective communication with humans. For example, the navigation agent should be able to correctly locate the navigation progress according to the current visual view; in addition, the navigation agent needs to examine the environment from a global perspective to determine whether the navigable viewpoints contain relevant landmarks or whether the instructions are ambiguous. In any case, the agent should be able to describe its target viewpoint. It is challenging to expect the navigation agent to obtain the above understanding only through navigation - related signals, so intermediate guidance is required. To this end, the authors introduce a hint generator named NavHint, which aims to generate visual descriptions as indirect supervision to help the navigation agent better understand the visual environment. When the agent moves in each navigation step, the hint generator will simultaneously generate visual descriptions that are consistent with the agent's action decisions. These hints are designed based on the logic behind the navigation process and include three aspects: sub - instructions, landmark ambiguity, and target unique objects. Specifically: 1. **Sub - instructions**: Encourage the agent to report its navigation progress by specifying the part of the sub - instruction to be executed according to the current visual environment. 2. **Landmark ambiguity**: Guide the agent to examine the entire environment from a global perspective and identify the landmarks mentioned in the instructions from all candidate viewpoints. The agent needs to identify potential challenges, evaluate the visibility of the landmarks, and compare the differences in the landmarks shared by each viewpoint. 3. **Target unique objects**: In the presence of challenges, guide the agent to describe the unique visual objects that only appear in the target viewpoint, helping the agent gain in - depth understanding of the details of the selected viewpoint while comparing it with other candidate viewpoints from a global perspective. In this way, the hint generator not only improves navigation performance, but also enhances the interpretability of the agent's behavior.