You Only Need Less Attention at Each Stage in Vision Transformers

Shuoxi Zhang,Hanpeng Liu,Stephen Lin,Kun He
2024-06-01
Abstract:The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules, which perform dot product computations among patchified image tokens. While self-attention modules empower ViTs to capture long-range dependencies, the computational complexity grows quadratically with the number of tokens, which is a major hindrance to the practical application of ViTs. Moreover, the self-attention mechanism in deep ViTs is also susceptible to the attention saturation issue. Accordingly, we argue against the necessity of computing the attention scores in every layer, and we propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage and calculates the subsequent feature alignments in other layers via attention transformations that leverage the previously calculated attention scores. This novel approach can mitigate two primary issues plaguing traditional self-attention modules: the heavy computational burden and attention saturation. Our proposed architecture offers superior efficiency and ease of implementation, merely requiring matrix multiplications that are highly optimized in contemporary deep learning frameworks. Moreover, our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper mainly discusses the application of Vision Transformers (ViTs) in the field of computer vision, especially the high computational complexity and attention saturation issues they face when processing image information. The authors propose a new architecture called Less-Attention Vision Transformer (LaViT) to alleviate these problems. LaViT calculates only a small amount of attention operations at each stage and utilizes the previously computed attention scores to generate feature alignment for subsequent layers, reducing the computational burden and attention saturation of traditional self-attention mechanism. The paper mentions that although ViTs capture global information through self-attention modules, the computational complexity increases quadratically with the number of layers, and attention saturation may occur, meaning that deep attention matrices have little change. To address these issues, LaViT introduces two innovations: 1. Less Attention layers: Compute traditional self-attention in the early stage of each phase, and then efficiently generate attention scores for subsequent layers using the computed results to avoid redundant and expensive computations. 2. Attention Residual connections: Utilize the learned attention relationships from previous stages in the downsampling operation across stages to maintain the transmission of important semantic information, while reducing the computational burden through the attention downsampling module. In addition, the paper proposes a new loss function, Diagonality Preserving loss, to maintain the basic characteristics of the attention matrix during the transformation process and ensure that the attention matrix accurately reflects the relative importance between input tokens. Experimental results show that LaViT performs well in tasks such as image classification, detection, and segmentation. Compared to existing state-of-the-art variants of ViT, LaViT achieves higher efficiency while maintaining or improving performance, reducing floating-point operations per second (FLOPs), and memory consumption.