Learning More Discriminative Clues with Gradual Attention for Fine-Grained Visual Categorization.

Qin Xu,Mengquan Zhang,Yun Li,Zhifu Tao
DOI: https://doi.org/10.1016/j.imavis.2023.104753
2022-01-01
SSRN Electronic Journal
Abstract:Fine-grained visual categorization, which aims to identify the different subcategories of images within the same category, is a very challenging task due to the large intra-class differences and subtle inter-class variances. The existing methods mostly focus on the salient local regions and ignore other features which probably help to recognize the images more precisely. To address this issue, in this paper, we propose a novel end-to-end network composed of the self-calibrated convolution, gradual attention module and feature inverse module for fine-grained visual categorization. To extract the salient features, the self-calibrated convolution is exploited which can avoid the influence of irrelevant information and locate salient regions more accurately. In aiming to extract the discriminative features, we propose the gradual attention module which consists of alternate channel-spatial attention and hierarchical feature grouping. The gradual attention module can extract the subtle discriminative features gradually even when the semantic information of shallow stages is not rich. Moreover, we design the feature inverse module which forces the next stage of network to search for other different useful features by feature inverse. The gradual attention module combined with the feature inverse module is capable of finding more detailed regions and of benefit to improving classification performance. Finally, the stage features and fused features are jointly used for classification. The proposed method is evaluated on three classical fine-grained image datasets and compared with a number of state-of-the-art methods. Our method achieves 89.5%, 95.2% and 93.9% accuracies on CUB-200-2011, Stanford Cars and FGVC-Aircraft datasets respectively. The experimental results demonstrate the effectiveness and superiority of the proposed method.
What problem does this paper attempt to address?