Vision-and-Language Navigation via Latent Semantic Alignment Learning

Siying Wu,Xueyang Fu,Feng Wu,Zheng-Jun Zha
DOI: https://doi.org/10.1109/tmm.2024.3358112
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Vision-and-Language Navigation (VLN) requires that an agent can comprehensively understand the given instructions and the immediate visual information obtained from the environment, so as to make correct actions to achieve the navigation goal. Therefore, semantic alignment across modalities is crucial for the agent understanding its own state during the navigation process. However, the potential of semantic alignment has not been systematically explored in current studies, which limits the further improvement of navigation performance. To address this issue, we propose a new Latent Semantic Alignment Learning method to develop the semantically aligned relationships contained in the environment. Specifically, we introduce three novel pre-training tasks: Trajectory-conditioned Masked Fragment Modeling, Action Prediction of Masked Observation, and Hierarchical Triple Contrastive Learning. The first two tasks are used to reason about cross-modal dependencies, while the third one is able to learn semantically consistent representations across modalities. In this way, the Latent Semantic Alignment Learning method establishes a consistent perception of the environment and makes the agent's actions easier to explain. Experiments on common benchmarks verify the effectiveness of our proposed methods. For example, we improve the Success Rate by 1.6% on the R2R validation unseen set and 4.3% on the R4R validation unseen set over the baseline model.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?