TripleFormer: improving transformer-based image classification method using multiple self-attention inputs

Yu Gong,Peng Wu,Renjie Xu,Xiaoming Zhang,Tao Wang,Xuan Li
DOI: https://doi.org/10.1007/s00371-024-03294-6
IF: 2.835
2024-03-03
The Visual Computer
Abstract:Transformer network structures have significantly improved the performance of computer vision (CV) tasks. However, due to the restriction of taking high-dimensional tokens as model inputs and the singularity of tokenization method, it is computationally intensive and easily deteriorates local fine-grained features when recognizing images based on Transformer architecture. In this paper, we incorporate a novel triple self-attention mechanism as a single encoder block and integrate it with the Transformer structure to introduce a new model, namely TripleFormer. Firstly, modeling information inside the window rather than the entire image, we present two sequence inputs from orthogonal perspectives, which are named Patch Attention In Spatial (PAIS) and Patch Attention In Channel (PAIC) for capturing more detailed features. We further partition the multi-channel of one feature map along spatial dimensions and compute attention belonging to same channels. In this way, the proposed encoder block incorporates local feature extraction and long-range visual dependencies to boost the feature learning capability. Finally, experiments on ImageNet-1K and CIFAR100 datasets exhibit the superiority of our proposed models as compared to other methods in terms of lower FLOPs and complexity while maintaining similar accuracy. In addition, our models demonstrate competitive performance on small-scale datasets in comparison to other pure Transformer models.
computer science, software engineering
What problem does this paper attempt to address?