A Sequence-selective Fine-grained Image Recognition Strategy Using Vision Transformer

Yulin Cai,Haoqian Wang,Xingzheng Wang
DOI: https://doi.org/10.1109/ist55454.2022.9827667
2022-01-01
Abstract:Aiming at precise sub-category classification of images, fine-grained image recognition requires the algorithms to enjoy a remarkable ability of subtle feature extraction. Recently, the architecture of Transformer has been successfully applied in vision tasks, bringing a novel approach to improve feature extraction performance of fine-grained image recognition algorithms. However, fine-grained image datasets are usually quite limited in capacity, which are unfavorable for the data-consuming training process of Transformers. In order to increase the available amount of data for training, in this paper we firstly introduce a stochastic image data augmentation method for Vision Transformer (ViT), which uses a Dense-DETR model to extract feature regions and performs random insertion and removal for the transformed patch sequence. To select the most informative sequence elements in the forward propagation process, we implement a feature patch selection strategy by applying an additional convolutional network structure to ViT encoders. Inspired from active learning, a contrastive loss utilizing the posterior information of paired images is also introduced as a penalty item of ViT's cross-entropy loss objective. Such strategies can make the ViT extract the most discriminative feature information from its input. Extensive experiments have supported that the proposed sequence-selective Vision Transformer reaches the highest recognition accuracies on several frequently-used fine-grained image datasets.
What problem does this paper attempt to address?