PS-DeiT: A Part-Selection Based DeiT for Fine-Grained Classification

Huan Gao,Yu Guo,Tingting Zhao,Zhiqiang Hu,Yarui Chen,Ning Xie
DOI: https://doi.org/10.1007/978-981-97-5612-4_19
2024-01-01
Abstract:Fine-grained visual classification (FGVC) is a highly challenging task due to the inherently subtle inter-class differences and the large intra-class differences. Researchers have attempted to address this challenge through Convolutional Neural Network (CNN)-based and Transformer-based methods, each of which has its own unique advantages. In order to share the metrics of both CNN-based and Transformer-based methods simultaneously and learn the latent features of fine-grained images efficiently. We introduce knowledge distillation to the field of FGVC for the first time and propose a novel method of Part-Selection based Data-efficient image Transformer (PS-DeiT), which incorporated the strengths of both CNN and Transformer models. More specifically, we propose the Part Selection Module to select the most discriminative image regions and exclude irrelevant regions, and employ a contrastive loss function that measures the similarity of images to distinguish the confusable classes in the task. Finally, we demonstrate the effectiveness of the proposed method PS-DeiT on four popular fine-grained datasets, i.e., CUB-200-2011, Stanford Cars, Stanford Dogs and NABirds, which achieves the accuracy of 90.8%, 95.0%,95.1% and 90.8% respectively. Furthermore, we illustrate the effect of the Part-Selection Module through visualization, the results show that attention of the proposed method more focuses on the recognition object.
What problem does this paper attempt to address?