Rethinking Attention Mechanisms in Vision Transformers with Graph Structures

Hyeongjin Kim,Byoung Chul Ko
DOI: https://doi.org/10.3390/s24041111
IF: 3.9
2024-02-09
Sensors
Abstract:In this paper, we propose a new type of vision transformer (ViT) based on graph head attention (GHA). Because the multi-head attention (MHA) of a pure ViT requires multiple parameters and tends to lose the locality of an image, we replaced MHA with GHA by applying a graph to the attention head of the transformer. Consequently, the proposed GHA maintains both the locality and globality of the input patches and guarantees the diversity of the attention. The proposed GHA-ViT commonly outperforms pure ViT-based models using small-sized CIFAR-10/100, MNIST, and MNIST-F datasets and a medium-sized ImageNet-1K dataset in scratch training. A Top-1 accuracy of 81.7% was achieved for ImageNet-1K using GHA-B, which is a base model with approximately 29 M parameters. In addition, with CIFAR-10/100, the existing ViT and parameters are reduced 17-fold and the performance increased by 0.4/4.3%, respectively. The proposed GHA-ViT shows promising results in terms of the number of parameters and operations and the level of accuracy in comparison with other state-of-the-art ViT-lightweight models.
engineering, electrical & electronic,chemistry, analytical,instruments & instrumentation
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key issues existing in the existing Vision Transformer (ViT) when dealing with image classification tasks: 1. **Redundancy of parameters in the multi - head attention mechanism (MHA)** : The traditional ViT uses the multi - head attention mechanism (MHA), which requires a large number of parameters and is prone to ignoring the local structure of the image, resulting in a decline in performance. 2. **Balance between image locality and globality** : Existing ViT models often fail to maintain both local and global features simultaneously when processing images. Especially when dealing with high - resolution images, due to the large number of self - attention operations, the computational complexity increases. 3. **Performance issues on small - scale datasets** : ViT models usually require a large amount of training data to achieve better performance, and when trained on small - scale datasets, the performance drops significantly. To solve these problems, the author proposes a new Vision Transformer model (GHA - ViT) based on graph - head attention (GHA). Specifically, the main contributions of the paper include: - **Introduction of the graph - head attention mechanism (GHA)** : By introducing a graph structure into the attention heads of the Transformer and replacing the traditional multi - head attention mechanism (MHA), the number of parameters is reduced while maintaining the locality and globality of the image. - **Application of graph generation and graph attention network (GAT)** : Through the graph generation process, the adjacency matrix of the graph is extracted from the attention matrix, and the graph attention network (GAT) is applied to update the node features, enhancing the diversity and locality of attention. - **Reduction of computational amount and improvement of performance** : GHA - ViT is trained from scratch on small - scale and medium - scale datasets without pre - training and can still achieve performance comparable to or even better than that of the existing state - of - the - art lightweight ViT models. Through these improvements, GHA - ViT not only reduces the number of parameters and computational complexity but also demonstrates excellent classification performance on multiple benchmark datasets.