Abstract:In this paper, we propose a new type of vision transformer (ViT) based on graph head attention (GHA). Because the multi-head attention (MHA) of a pure ViT requires multiple parameters and tends to lose the locality of an image, we replaced MHA with GHA by applying a graph to the attention head of the transformer. Consequently, the proposed GHA maintains both the locality and globality of the input patches and guarantees the diversity of the attention. The proposed GHA-ViT commonly outperforms pure ViT-based models using small-sized CIFAR-10/100, MNIST, and MNIST-F datasets and a medium-sized ImageNet-1K dataset in scratch training. A Top-1 accuracy of 81.7% was achieved for ImageNet-1K using GHA-B, which is a base model with approximately 29 M parameters. In addition, with CIFAR-10/100, the existing ViT and parameters are reduced 17-fold and the performance increased by 0.4/4.3%, respectively. The proposed GHA-ViT shows promising results in terms of the number of parameters and operations and the level of accuracy in comparison with other state-of-the-art ViT-lightweight models.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key issues existing in the existing Vision Transformer (ViT) when dealing with image classification tasks: 1. **Redundancy of parameters in the multi - head attention mechanism (MHA)** : The traditional ViT uses the multi - head attention mechanism (MHA), which requires a large number of parameters and is prone to ignoring the local structure of the image, resulting in a decline in performance. 2. **Balance between image locality and globality** : Existing ViT models often fail to maintain both local and global features simultaneously when processing images. Especially when dealing with high - resolution images, due to the large number of self - attention operations, the computational complexity increases. 3. **Performance issues on small - scale datasets** : ViT models usually require a large amount of training data to achieve better performance, and when trained on small - scale datasets, the performance drops significantly. To solve these problems, the author proposes a new Vision Transformer model (GHA - ViT) based on graph - head attention (GHA). Specifically, the main contributions of the paper include: - **Introduction of the graph - head attention mechanism (GHA)** : By introducing a graph structure into the attention heads of the Transformer and replacing the traditional multi - head attention mechanism (MHA), the number of parameters is reduced while maintaining the locality and globality of the image. - **Application of graph generation and graph attention network (GAT)** : Through the graph generation process, the adjacency matrix of the graph is extracted from the attention matrix, and the graph attention network (GAT) is applied to update the node features, enhancing the diversity and locality of attention. - **Reduction of computational amount and improvement of performance** : GHA - ViT is trained from scratch on small - scale and medium - scale datasets without pre - training and can still achieve performance comparable to or even better than that of the existing state - of - the - art lightweight ViT models. Through these improvements, GHA - ViT not only reduces the number of parameters and computational complexity but also demonstrates excellent classification performance on multiple benchmark datasets.

Rethinking Attention Mechanisms in Vision Transformers with Graph Structures

Constituent Attention for Vision Transformers

MaxViT: Multi-Axis Vision Transformer

Global Context Vision Transformers

Vision Transformer with Attention Map Hallucination and FFN Compaction

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Fusion of regional and sparse attention in Vision Transformers

HydraViT: Stacking Heads for a Scalable ViT

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Rethinking Vision Transformers for MobileNet Size and Speed

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

RegionViT: Regional-to-Local Attention for Vision Transformers

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights

BViT: Broad Attention based Vision Transformer

Improving Vision Transformers by Revisiting High-Frequency Components

DctViT: Discrete Cosine Transform Meet Vision Transformers

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention