Abstract:Visual perception and navigation have emerged as major focus areas in the field of embodied artificial intelligence. We consider the task of image-goal navigation, where an agent is tasked to navigate to a goal specified by an image, relying only on images from an onboard camera. This task is particularly challenging since it demands robust scene understanding, goal-oriented planning and long-horizon navigation. Most existing approaches typically learn navigation policies reliant on recurrent neural networks trained via online reinforcement learning. However, training such policies requires substantial computational resources and time, and performance of these models is not reliable on long-horizon navigation. In this work, we present a generative Transformer based model that jointly models image goals, camera observations and the robot's past actions to predict future actions. We use state-of-the-art perception models and navigation policies to learn robust goal conditioned policies without the need for real-time interaction with the environment. Our model demonstrates capability in capturing and associating visual information across long time horizons, helping in effective navigation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to enable a robot to navigate according to the target image without a map. Specifically, the paper focuses on the image - goal navigation task, that is, the robot needs to navigate to the target location based on a given target image (RGB image) by relying only on the images obtained by the on - board camera. This task is particularly challenging because it requires the robot to have strong scene - understanding ability, goal - oriented planning ability and long - term navigation ability. ### Main contributions of the paper 1. **Proposing a Transformer - based model**: The paper proposes a generative Transformer model that can jointly model the target image, camera observations and the robot's past actions to predict future actions. Compared with the traditional recurrent neural network (RNN) - based methods, the Transformer model performs better in dealing with long - term sequential dependencies, and can capture and correlate visual information in a long - time range, thereby improving the navigation effect. 2. **Behavior cloning method**: In order to overcome the huge computational resources and time required for online reinforcement learning training, the paper adopts the behavior cloning (BC) method. By using the trajectory data generated by the pre - trained expert agent, the model can learn a robust goal - conditional policy without real - time interaction with the environment. 3. **Experimental verification**: The paper conducts extensive experiments on the Habitat simulator to verify the performance of the proposed model on test sets of different difficulty levels. The experimental results show that the model outperforms other behavior cloning models on most data splits and can also achieve good performance with smaller input image sizes. ### Key technical points - **Input representation**: The target image and the observed image are encoded into embedding vectors by the pre - trained DINOv2 model, and the actions are encoded into embedding vectors through a lookup table. These embedding vectors are interleaved and position embeddings are added to form an input sequence. - **Transformer decoder**: The model adopts a 12 - layer Transformer decoder, each layer contains 8 multi - head self - attention (MHSA) blocks and a feed - forward network. The decoder uses a causal self - attention mask to prevent the model from "peeking" at future actions. - **Behavior cloning training**: The model learns the policy by minimizing the error between the generated actions and the expert actions. ### Experimental results - **Accuracy**: On different test sets, the action accuracy of the model is between 75% and 80%, especially performing better on the "Hard" - difficulty trajectories. - **Navigation performance**: In unseen environments, both the navigation success rate and the path length - weighted success (SPL) metric of the model are better than those of other behavior cloning models. ### Conclusion The paper proposes a Transformer - based behavior cloning method for solving the image - goal navigation task. This method can effectively learn a robust navigation policy without the need for real - time interaction with the environment, significantly improving the accuracy and efficiency of navigation.

Transformers for Image-Goal Navigation

Goal-Oriented Visual Semantic Navigation Using Semantic Knowledge Graph and Transformer

Image-Goal Navigation in Complex Environments via Modular Learning

Goal-Guided Transformer-Enabled Reinforcement Learning for Efficient Autonomous Navigation

Transformer Memory for Interactive Visual Navigation in Cluttered Environments

NavFormer: A Transformer Architecture for Robot Target-Driven Navigation in Unknown and Dynamic Environments

One-4-All: Neural Potential Fields for Embodied Navigation

Towards Target-Driven Visual Navigation in Indoor Scenes via Generative Imitation Learning

TransNav: spatial sequential transformer network for visual navigation

Cognitive Planning for Object Goal Navigation using Generative AI Models

MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation

Navigating to Objects Specified by Images

Transformers for One-Shot Visual Imitation

Last-Mile Embodied Visual Navigation

Multigoal Visual Navigation With Collision Avoidance via Deep Reinforcement Learning

Learning Deployable Navigation Policies at Kilometer Scale from a Single Traversal

Causality-Aware Transformer Networks for Robotic Navigation

Visual Representations for Semantic Target Driven Navigation

Building Intelligent Autonomous Navigation Agents

VME-Transformer: Enhancing Visual Memory Encoding for Navigation in Interactive Environments

Relation-wise transformer network and reinforcement learning for visual navigation