TransNav: spatial sequential transformer network for visual navigation
Kang Zhou,Huyin Zhang,Fei Li
DOI: https://doi.org/10.1093/jcde/qwac084
2022-08-31
Journal of Computational Design and Engineering
Abstract:Abstract Visual navigation task is to steer an embodied agent finding the given target based on observation. The effective transformer from observation of the agent to visual representation determines the navigation actions and promotes more informed navigation policy. In this work, we propose a spatial sequential transformer network (SSTNet) for learning informative visual representation in deep reinforcement learning. SSTNet is composed by spatial attention probability fused model (SAF) and sequential transformer network (STNet). SAF enforces cross-modal state into visual clues in reinforcement learning. It encodes semantic information about observed objects, as well as spatial information about their location, which jointly exploiting image inter-relations. STNet generates (imagines) the next observations and makes action inference of the aspects most relevant to the target. It decodes the image intra-relations. This way, the agent learns to understand the causality between navigation actions and dynamic changes in observations. SSTNet is conditioned on an auto-regressive model on the desired reward, past states, actions, and knowledge graph. The whole navigation framework considers the local and global visual information, as well as time sequential information. Thus, it allows the agent to navigate towards the sought-after object effectively. We evaluate our model on the AI2THOR framework show that our method attains at least $10\%$ improvement of average success rate over most state-of-the-art models. Code and datasets can be found in https://github.com/zhoukang123/SDTNet_2022.
computer science, interdisciplinary applications,engineering, multidisciplinary