Learning multimodal adaptive relation graph and action boost memory for visual navigation
Jian Luo,Bo Cai,Yaoxiang Yu,Aihua Ke,Kang Zhou,Jian Zhang
DOI: https://doi.org/10.1016/j.aei.2024.102678
IF: 8.8
2024-07-19
Advanced Engineering Informatics
Abstract:The task of visual navigation (VN) is steering the agent find target object only using visual perceptions. Previous works largely exploit multimodal information (e.g. visual and training memory) to improve the environmental perception ability, while making less effort to leverage interchange information. Besides, multimodal fusion tends to ignore the data dependencies (prefer a part of the modal data) as well as the supervision of the action. In this work, we present a novel multimodal graph learning (MGL) structure for VN, which consists of three parts. (1) the multimodal fusion exploits the rich information across spatial, RGB, and depth information about objects' place, as well as semantic information about their categories, (2) adaptive relation graph (ARG) is dynamically built using object detectors, which encodes multimodal fusion and adapt to a novel environment. It embeds its navigation history and other useful task-oriented structural information, thus make the agent own the association ability and make advisable informed decisions and (3) action boost module (ABM) aims to assist the agent make intelligent decisions, which predicts more accurate action using beneficial training experience. Our agent can foresight what the goal state may look like and how to get closer towards that state. These combinations of the "what" and the "how" allow the agent to navigate to the target object effectively. We validate our approach on the AI2-THOR dataset. It reports 24.2% and 23.7% increase in SPL(Success weighted by Per Length) and SR(Success Rate) compared with baselines, respectively. Code and datasets can be found in https://github.com/luosword/ABM_VN .
engineering, multidisciplinary,computer science, artificial intelligence