A Survey on Visual Transformer

Kai Han,Yunhe Wang,Hanting Chen,Xinghao Chen,Jianyuan Guo,Zhenhua Liu,Yehui Tang,An Xiao,Chunjing Xu,Yixing Xu,Zhaohui Yang,Yiman Zhang,Dacheng Tao
DOI: https://doi.org/10.1109/TPAMI.2022.3152247
2023-07-10
Abstract:Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that, with the success of Transformers in the field of natural language processing (NLP), researchers have begun to explore how to apply them to computer vision (CV) tasks. Specifically, the paper aims to provide a comprehensive review of Vision Transformers, covering their applications in different visual tasks, including backbone networks, high - level / intermediate - level vision, low - level vision and video processing, and analyzing the advantages and disadvantages of these models. In addition, the paper also discusses the current challenges faced by Vision Transformers and proposes future research directions. ### The main contents of the paper include: 1. **Background Introduction**: - Deep neural networks (DNNs) have become the basic architecture of today's artificial intelligence systems. - Different types of networks are suitable for different tasks, such as multi - layer perceptrons (MLP), convolutional neural networks (CNNs) and recurrent neural networks (RNNs). - Transformers are a new type of neural network, mainly using the self - attention mechanism, and have achieved remarkable breakthroughs in natural language processing tasks. 2. **Development of Vision Transformers**: - Inspired by the successful application of Transformers in NLP, researchers began to apply Transformers to computer vision tasks. - Early works such as ViT (Vision Transformer) directly applied pure Transformers to image classification tasks and achieved performance comparable to or even better than that of CNNs. - Subsequently, many variants and improved Vision Transformers have been proposed to enhance local feature extraction, optimize the self - attention mechanism and design new network architectures. 3. **Applications of Vision Transformers**: - **Backbone Networks**: Models such as ViT, TNT, Swin, etc., are used for image classification tasks. - **High - level / Intermediate - level Vision**: Including object detection (such as DETR), semantic segmentation (such as Max - DeepLab) and other tasks. - **Low - level Vision**: Such as super - resolution, image denoising and style transfer and other tasks. - **Video Processing**: Such as video inpainting, video caption generation and other tasks. 4. **Technical Details**: - **Self - Attention Mechanism**: By calculating scores between input vectors, normalizing scores, converting to probability distributions and weighted summation, etc., global interaction of the input sequence is achieved. - **Multi - Head Attention Mechanism**: By using multiple different representation sub - spaces, the performance of the self - attention layer is improved. - **Feed - Forward Network**: Applied after the self - attention layer of each encoder and decoder, it contains two linear transformation layers and a nonlinear activation function. - **Residual Connections and Layer Normalization**: Strengthen information flow and improve model performance. 5. **Future Research Directions**: - **Model Compression and Efficiency**: Research on how to reduce the computational complexity of Transformers to make them more suitable for practical devices. - **Local Feature Extraction**: Further enhance the ability of Transformers in local feature extraction. - **New Network Architectures**: Explore the design of Transformer architectures more suitable for visual tasks. In general, through a comprehensive review of Vision Transformers, this paper not only summarizes the current research progress, but also points out possible future research directions, providing valuable references for researchers in the field of computer vision.