Ts-vit: feature-enhanced transformer via token selection for fine-grained image recognition
Yingge Wang,Hu Liang,Changchun Wen,Shengrong Zhao
DOI: https://doi.org/10.1007/s11760-024-03693-1
IF: 1.583
2024-12-16
Signal Image and Video Processing
Abstract:Fine-grained image recognition is a challenging task that focuses on identifying images from similar subordinate categories. Recently, methods based on the vision transformer (ViT) have demonstrated remarkable achievements in fine-grained image recognition, which inherent multi-head self-attention (MHSA) can effectively capture the discriminative regions in images. However, most of these ViT-based methods ignore the channel relationships of image features, and there are also problems with the inconsistent learning performance of different heads in MHSA and different layers in ViT. To address these issues, an innovative feature-enhanced transformer is proposed, named TS-ViT. TS-ViT includes three key modules: soft channel attention (Soft-CA), multi-head token selection (MHTS), and multi-level feature enhancement (MLFE). The Soft-CA enables the model to concentrate on the relationships among various channels of image features. The MHTS is proposed to address the issue of inconsistent multi-head learning performance. It selects tokens with discriminative region positions based on attention maps to form the multi-level feature. By employing contrastive learning and enhanced feature extraction, the MLFE is proposed to effectively utilize multi-level features while mitigating background noise. Extensive experiments have demonstrated that TS-ViT achieves superior performance compared to popular methods, with average accuracy of 91.8%, 91.2%, 99.5%, and 93.9% on the experimental data sets, respectively. Furthermore, TS-ViT demonstrated outstanding performance in terms of computational complexity and efficiency, with an average parameter count of 93.5M, FLOPs of 73.2G, training time of 7.8 h, and inference time of 4.1 milliseconds.
engineering, electrical & electronic,imaging science & photographic technology