Depth-Aware Vision-and-Language Navigation Using Scene Query Attention Network

Sinan Tan,Mengmeng Ge,Di Guo,Huaping Liu,Fuchun Sun
DOI: https://doi.org/10.1109/icra46639.2022.9811921
2022-01-01
Abstract:Vision-and-language navigation (VLN) has been an important task in the field of Robotics and Computer Vision. However, most existing vision-and-language navigation models only use features extracted from RGB observation as input, while robots can utilize depth sensors in the real world. Existing research has also shown that simply adding a depth stream to neural models could only provide a marginal improvement to the performance of the VLN task. Therefore, in our work, we develop a novel method for the VLN task using semantic map observations built from RGB-D input. We use vision-pretraining to efficiently encode the semantic map with CNN and scene query attention network by answering queries about semantic information of specific regions of a scene. The proposed method could be used with a simple model and does not require large-scale vision-language transformer pretraining, bringing a more than 10% increase in the success rate compared with a baseline model. When used together with the Speaker-Follower training technique, it achieves a success rate of 58 % on the test set for the R2R dataset in single-run setting, outperforming the previous RGB-D method and most existing RGB-only models that do not use large-scale vision-language transformers pretraining.
What problem does this paper attempt to address?