LMSA: Low-Relation Multi-head Self-attention Mechanism in Visual Transformer

Jingjie Wang,Xiang Wei,Xiaoyu Liu,Siyang Lu
DOI: https://doi.org/10.1007/978-981-99-0923-0_61
2023-01-01
Abstract:The Transformer backbone network with self-attention mechanism as the core has achieved great success in the field of natural language processing and computer vision. However, compared with classical visual feature extraction methods, the self-attention mechanism needs more training data to capture the relationship between tokens, which makes it challenging to train the transformer effectively on small datasets. We design a novel lightweight self-attention mechanism: Low-relation Multi-head Self-Attention (LMSA), which is superior to the recent self-attention and can fully explore the relationship between rare tokens. Specifically, the proposed self-attention mechanism breaks the barrier of the dimensional consistency of the traditional self-attention mechanism, making the feature relationship focus on a small number of dimensions, thereby reducing computational complexity and occupying less storage space. Experimental results show that the dimensional consistency inside the traditional self-attention mechanism is unnecessary. In particular, using Swin as the backbone model for training, the accuracy of the CIFAR-10 image classification task is improved by 0.43%, in the meanwhile, the consumption of a single self-attention resource is reduced by 64.58%, and the number of model parameters and model size is reduced by more than 15%. By appropriately condensing the self-attention relationship variables, the Transformer network can be more efficient and even perform better.
What problem does this paper attempt to address?