Multistage attention region supplement transformer for fine-grained visual categorization

Aokun Mei,Hua Huo,Jiaxin Xu,Ningya Xu
DOI: https://doi.org/10.1007/s00371-024-03502-3
IF: 2.835
2024-06-19
The Visual Computer
Abstract:The classification of fine-grained images using computer technology employs neural network models to distinguish between instances of different classes that share very similar visual content. Thus, learning to extract nuanced representations of selected object details is essential. This paper introduces a novel fine-grained visual categorization model, the Multistage Attention Region Supplement Transformer (MARS-Trans), which is based on the Vision Transformer (ViT). Our main contributions are threefold. Firstly, we observed that in the ViT's multi-head attention module, the softmaxed feature results of each attention head are directly concatenated and multiplied by weights. Consequently, we propose a Multistage Attention Module (MAM) to grade the attention heads based on their importance. Additionally, we introduce a Region Supplement Module (RSM) to suppress non-critical regions and enhance edge information in key areas, thereby highlighting the discriminative features. Finally, we employ our proposed Approximate Adjust Method (AAM) to refine the final features and improve categorization results. We conducted comprehensive experiments with MARS-Trans on five popular public fine-grained image datasets, validating the effectiveness of these modules. Our model achieved state-of-the-art (SOTA) accuracy on one dataset and SOTA average precisions on three datasets. The code is available on https://github.com/ArrikenMei/MARS-Trans.
computer science, software engineering
What problem does this paper attempt to address?