Attention-based Multi-scale ViT Fine-grained Visual Classification

Junya Liu,Zhen Yang,Rujia Li,Xin Zhou,Zhijian Yin
DOI: https://doi.org/10.1145/3577530.3577586
2022-12-09
Abstract:Fine-grained visual classification (FGVC) is a challenging task in image classification due to the small differences between classes and the large differences within subclasses. In the early works, some methods mainly rely on constructing bounding box annotations and integrating attention mechanisms based on CNN methods for fine-grained visual classification. In recent years, the Vision Transformer (ViT) has begun to show better performance in image classification, object detection, and object tracking. To further investigate the performance of ViT in FGVC, this paper proposes to combine the CNN method with ViT and introduce a dual-path hierarchy into the pyramid structure - top-down feature path and bottom-up channel-spatial attention path; DropBlock is used to accurately localize discriminative regions; SENet and global covariance pooling (GCP) measures are used to further enhance the ability of the network model to extract feature maps information. The Attention-based Multi-scale ViT Fine-grained Visual Classification (AMViT-CNN) proposed in this work has achieved good classification results on public fine-grained datasets (CUB-200-2011, Stanford-Cars).
Computer Science
What problem does this paper attempt to address?