Abstract:Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g. , Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities ( e.g. , images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks ( e.g. , image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks ( e.g. , visual-question answering, visual reasoning, and visual grounding), video processing ( e.g. , activity recognition, video forecasting), low-level vision ( e.g. , image super-resolution, image enhancement, and colorization) and 3D analysis ( e.g. , point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.

Transformers in Natural Language Processing: A Comprehensive Review

Transformers in Speech Processing: A Survey

Natural language processing with transformers: a review

Advancements in Natural language Processing: An In-depth Review of Language Transformer Models

Transformers in Machine Learning: Literature Review

Transformers in Vision: A Survey

A Review of Transformer-Based Models for Computer Vision Tasks: Capturing Global Context and Spatial Relationships

Transformers in computational visual media: A survey

A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks

Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis

Comprehensive review of Transformer‐based models in neuroscience, neurology, and psychiatry

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Transformers in medical imaging: A survey

Transformers in Medical Image Analysis

Transformers and large language models in healthcare: A review

What comes after transformers? -- A selective survey connecting ideas in deep learning

A Survey on Transformers in NLP with Focus on Efficiency

Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives

Generative AI in the Era of Transformers: Revolutionizing Natural Language Processing with LLMs

Transformers in medical image segmentation: a narrative review

A Survey of Transformers