Enhanced Vision Transformer with Dual-Dimensional Self-Attention for Image Recognition

Zhenxiong Chang,Qingyu Cai
DOI: https://doi.org/10.1109/prai59366.2023.10332027
2023-01-01
Abstract:This paper presents an improved model based on the Vision Transformer that integrates additional self-attention mechanisms and one-dimensional convolutions to enhance the performance of the Vision Transformer block. The process begins by dividing the image into multiple patches and applying positional encoding. The attention mechanism is first computed for the hidden variables, followed by recalculating the attention mechanism for the patch dimension, and finally, mapping the output result using one-dimensional convolution. By incorporating this mechanism, we capture a greater degree of feature correlations, thereby enhancing the model’s expressive capabilities. Our approach yields significant improvements in image recognition performance, surpassing both traditional Vision Transformer models and conventional convolutional neural networks when parameters and computational complexity are comparable. Of particular note is its effectiveness on relatively small datasets, validating the feasibility and efficiency of our proposed method in enhancing image recognition tasks, making it a promising solution for practical applications across diverse domains.
What problem does this paper attempt to address?