REVE-CE: Remote Embodied Visual Referring Expression in Continuous Environment

Xinghang Li,Di Guo,Huaping Liu,Fuchun Sun
DOI: https://doi.org/10.1109/lra.2022.3141150
2022-01-01
Abstract:Ithas always been a great challenge for the robot to navigate in the visual world following natural language instructions. Recently, several tasks such as the Vision-and-Language Navigation (VLN) and Remote Embodied Visual Referring Expression in Real Indoor Environments (REVERIE) are proposed trying to solve this challenge. And the most significant difference between VLN and REVERIE tasks is that REVERIE uses a higher guidance level instruction. However, the navigation process of REVERIE is implemented in a discrete environment, which is unrealistic in real world scenarios. To make the REVERIE task more consistent with the real physical world, we develop a new task of Remote Embodied Visual Referring Expression in Continuous Environment, namely REVE-CE, in which the agent executes a much longer sequence of low-level actions given language instructions. Furthermore, we propose a multi-branch cross modal attention (MBCMA) framework to solve the proposed REVE-CE task. Extensive experiments are conducted demonstrating that the proposed framework greatly outperforms the state-of-the-art VLN baselines and a new benchmark for the proposed REVE-CE task is built.
What problem does this paper attempt to address?