Abstract:Vision Transformers have made remarkable progress in recent years, achieving state-of-the-art performance in most vision tasks. A key component of this success is due to the introduction of the Multi-Head Self-Attention (MHSA) module, which enables each head to learn different representations by applying the attention mechanism independently. In this paper, we empirically demonstrate that Vision Transformers can be further enhanced by overlapping the heads in MHSA. We introduce Multi-Overlapped-Head Self-Attention (MOHSA), where heads are overlapped with their two adjacent heads for queries, keys, and values, while zero-padding is employed for the first and last heads, which have only one neighboring head. Various paradigms for overlapping ratios are proposed to fully investigate the optimal performance of our approach. The proposed approach is evaluated using five Transformer models on four benchmark datasets and yields a significant performance boost. The source code will be made publicly available upon publication.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to improve Vision Transformers by introducing overlap between heads in the Multi-Head Self-Attention (MHSA) mechanism. Specifically, the authors propose a Multi-Overlapped-Head Self-Attention (MOHSA) mechanism, which enhances information exchange and improves the performance of Vision Transformers by allowing each head to overlap with its adjacent heads in queries, keys, and values. ### Main Contributions 1. **Proposing MOHSA**: The authors propose a new multi-head self-attention mechanism—MOHSA, which improves the performance of Vision Transformers by allowing the queries, keys, and values of the current head to overlap with those of adjacent heads during attention computation. 2. **Various Overlap Ratio Variants**: They designed multiple overlap ratio schemes to thoroughly investigate the optimal performance of MOHSA. These schemes include fixed overlap dimensions, increasing or decreasing overlap dimensions by layer, etc. 3. **Extensive Experimental Validation**: Extensive experiments were conducted on multiple Vision Transformer models (e.g., ViT, CaiT, Swin-Transformer) to validate the effectiveness of MOHSA. The experimental results show that MOHSA significantly improves the performance of models on multiple benchmark datasets (e.g., CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1k). ### Method Overview 1. **Multi-Head Self-Attention Mechanism (MHSA)**: - MHSA is the core module of the Transformer model, which divides queries, keys, and values into different heads, each head independently computes self-attention to learn different representations. - In traditional MHSA, the queries, keys, and values of each head are hard-partitioned, with no information exchange. 2. **Multi-Overlapped-Head Self-Attention Mechanism (MOHSA)**: - In MOHSA, the queries, keys, and values of each head overlap with those of adjacent heads. - In this way, the current head can utilize information from other heads when computing attention, thereby enhancing information exchange. - Since the overlap slightly increases the dimension of tokens, a linear projection is needed after concatenation to restore the dimension to its original size. 3. **Overlap Ratio**: - The authors designed multiple overlap ratio schemes, including fixed overlap dimensions, increasing or decreasing overlap dimensions by layer, etc. - Experimental results show that different overlap ratio schemes exhibit different performance improvements on different models and datasets. ### Experimental Results - **CIFAR-10**: On ViT-Tiny and ViT-Small, MOHSA significantly improved the model accuracy, reaching 85.65% and 87.30%, respectively. - **CIFAR-100**: On ViT-Tiny and ViT-Small, MOHSA also significantly improved the model accuracy, reaching 63.01% and 64.97%, respectively. - **Tiny-ImageNet**: On ViT-Tiny and ViT-Small, MOHSA improved the accuracy by 1.61% and 1.27%, respectively. - **ImageNet**: On multiple models, MOHSA significantly improved the model performance, especially on CaiT-xxs24, where the "inc-0 (1)" scheme improved the accuracy by 3.70%. ### Conclusion By introducing overlap between heads, MOHSA significantly enhances the information exchange capability of Vision Transformers, thereby improving model performance on multiple datasets. The experimental results show that MOHSA is a simple and effective method that can significantly improve the performance of Vision Transformers without adding significant computational overhead.

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

Vision Transformers with Hierarchical Attention

Constituent Attention for Vision Transformers

Multi-manifold Attention for Vision Transformers

Improving Transformers with Dynamically Composable Multi-Head Attention

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

FAM: Improving columnar vision transformer with feature attention mechanism

Advancing Vision Transformers with Group-Mix Attention

Vision Transformer with Super Token Sampling

Vision Transformer With Quadrangle Attention

Dual Path Transformer with Partition Attention

Vision Transformers: State of the Art and Research Challenges

Local-to-Global Self-Attention in Vision Transformers

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization

Improve Vision Transformers Training by Suppressing Over-smoothing

Self-supervised Models are Good Teaching Assistants for Vision Transformers.

Transformer-Based Self-Supervised Monocular Depth and Visual Odometry

P2T: Pyramid Pooling Transformer for Scene Understanding

Vision Transformer with Attention Map Hallucination and FFN Compaction

TiC: Exploring Vision Transformer in Convolution