MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Qi Hang,Xuefeng Yan,Lina Gong
DOI: https://doi.org/10.1007/978-981-97-2387-4_15
2024-01-01
Abstract:In fine-grained visual classification, fusing both local and global information is crucial. However, current methods based on vision transformer tend to just focus on selecting discriminative patch tokens, which ignore the variation of rich global and semantic information in classification tokens at different layers. To address this limitation, we propose a novel framework dubbed MFF-Trans that considers the mutual relationships between all tokens. Specifically, we put forward the important token election module (ITEM) which utilizes multi-headed self-attention mechanism in vision transformer to evaluate the importance of all tokens. This module will guide the model to select tokens which contain discriminative local information and global information with different semantics at each ViT layer. Meanwhile, to enhance the model’s perception of semantic connection between selected patch tokens, we further introduce the semantic connection enhancing module (SCEM) which use the graph convolutional network to mine the structural information between them in deep layers of vision transformer. Extensive experimental results on three benchmark datasets indicate that MFF-Trans achieves satisfactory performance compared with other methods. We achieve good results in CUB (92.1%), Stanford Cars (95.4%), and Stanford Dogs (92.3%).
What problem does this paper attempt to address?