Visionary: vision-aware enhancement with reminding scenes generated by captions via multimodal transformer for embodied referring expression

Zhengwu Yuan,Peixian Tang,Xinguang Sang,Fan Zhang,Zheqi Zhang
DOI: https://doi.org/10.1007/s00371-024-03469-1
IF: 2.835
2024-06-01
The Visual Computer
Abstract:Embodied referring expression (REVERIE) is a challenging task that requires an embodied agent to autonomously navigate in unseen environment and locate the target object specified in the given instruction. One main challenge in this task is the scarcity of data, leading to low generalization ability and poor performance of the agent in unseen environments. To address these issues, we propose VISIONARY (VISION-Aware enhancement with Reminding scenes generated by captions), which leverages advanced pre-trained models as a source of common sense and generates additional valuable information from both linguistic and visual aspects based on the embodied agent's visual input to guide the navigation decision-making process. Specifically, the reminding scene generation mechanism is proposed to describe the observed scene in detail and generate corresponding reminding scenes, which can effectively enrich the input and serve as a supplement to the training data. Additionally, the caption-aware module and the adaptive fusion module are proposed to, respectively, inject the generated scene description and reminding scene into the model, thereby enhancing the navigation efficiency and generalization ability of the agent. Extensive experiments conducted on the REVERIE benchmark demonstrate the effectiveness of our proposed methods, achieving improvements of 2.34 and 2.32% on the key metrics SPL and RGSPL, respectively, in unseen environments compared to the previous state-of-the-art method. The code is available at https://github.com/tpxbps/visionary.
computer science, software engineering
What problem does this paper attempt to address?