Abstract:Despite some successful applications of goal-driven navigation, existing deep reinforcement learning (DRL)-based approaches notoriously suffers from poor data efficiency issue. One of the reasons is that the goal information is decoupled from the perception module and directly introduced as a condition of decision-making, resulting in the goal-irrelevant features of the scene representation playing an adversary role during the learning process. In light of this, we present a novel Goal-guided Transformer-enabled reinforcement learning (GTRL) approach by considering the physical goal states as an input of the scene encoder for guiding the scene representation to couple with the goal information and realizing efficient autonomous navigation. More specifically, we propose a novel variant of the Vision Transformer as the backbone of the perception system, namely Goal-guided Transformer (GoT), and pre-train it with expert priors to boost the data efficiency. Subsequently, a reinforcement learning algorithm is instantiated for the decision-making system, taking the goal-oriented scene representation from the GoT as the input and generating decision commands. As a result, our approach motivates the scene representation to concentrate mainly on goal-relevant features, which substantially enhances the data efficiency of the DRL learning process, leading to superior navigation performance. Both simulation and real-world experimental results manifest the superiority of our approach in terms of data efficiency, performance, robustness, and sim-to-real generalization, compared with other state-of-the-art (SOTA) baselines. The demonstration video (<a class="link-external link-https" href="https://www.youtube.com/watch?v=aqJCHcsj4w0" rel="external noopener nofollow">this https URL</a>) and the source code (<a class="link-external link-https" href="https://github.com/OscarHuangWind/DRL-Transformer-SimtoReal-Navigation" rel="external noopener nofollow">this https URL</a>) are also provided.

VME-Transformer: Enhancing Visual Memory Encoding for Navigation in Interactive Environments

Transformer Memory for Interactive Visual Navigation in Cluttered Environments

A Global-Memory-Aware Transformer for Vision-and-Language Navigation

MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

NavFormer: A Transformer Architecture for Robot Target-Driven Navigation in Unknown and Dynamic Environments

Spatially-Aware Transformer for Embodied Agents

Learning multimodal adaptive relation graph and action boost memory for visual navigation

History Aware Multimodal Transformer for Vision-and-Language Navigation

Vision-and-Language Navigation Generative Pretrained Transformer

TransNav: spatial sequential transformer network for visual navigation

A transformer-based deep reinforcement learning approach to spatial navigation in a partially observable Morris Water Maze

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Echo-Enhanced Embodied Visual Navigation

Learning Navigational Visual Representations with Semantic Map Supervision

Causality-Aware Transformer Networks for Robotic Navigation

MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

Goal-Guided Transformer-Enabled Reinforcement Learning for Efficient Autonomous Navigation

Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

Visionary: vision-aware enhancement with reminding scenes generated by captions via multimodal transformer for embodied referring expression