Abstract:Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g. , Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities ( e.g. , images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks ( e.g. , image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks ( e.g. , visual-question answering, visual reasoning, and visual grounding), video processing ( e.g. , activity recognition, video forecasting), low-level vision ( e.g. , image super-resolution, image enhancement, and colorization) and 3D analysis ( e.g. , point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.

Transformers in Unsupervised Structure-from-Motion

Unsupervised Full Transformer for Pose, Depth and Optical Flow Joint Learning

Transformer-Based Self-Supervised Monocular Depth and Visual Odometry

Transformer Transforms Salient Object Detection and Camouflaged Object Detection

3D Vision with Transformers: A Survey

Depth-Vision-Decoupled Transformer With Cascaded Group Convolutional Attention for Monocular 3-D Object Detection

Transformers in Vision: A Survey

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Transformers in computational visual media: A survey

On Moving Object Segmentation from Monocular Video with Transformers

Can Transformers Capture Spatial Relations between Objects?

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for Monocular 3D Object Detection

Transformers in Single Object Tracking: An Experimental Survey

Transformer-based stereo-aware 3D object detection from binocular images

Complete contextual information extraction for self-supervised monocular depth estimation

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Transformer Meets Remote Sensing Video Detection and Tracking: A Comprehensive Survey.

The Applications of 3D Input Data and Scalability Element by Transformer Based Methods: A Review

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured Traffic Scenarios