Abstract:The technique of extracting different distinguishing features by locating different part regions to achieve fine-grained visual classification (FGVC) has made significant improvements. Utilizing attention mechanisms for feature extraction has become one of the mainstream methods in computer vision, but these methods have certain limitations. They typically focus on the most discriminative regions and directly combine the features of these parts, neglecting other less prominent yet still discriminative regions. Additionally, these methods may not fully explore the intrinsic connections between higher-order and lower-order features to optimize model classification performance. By considering the potential relationships between different higher-order feature representations in the object image, we can enable the integrated higher-order features to contribute more significantly to the model's classification decision-making capabilities. To this end, we propose a saliency feature suppression and cross-feature fusion network model (SFSCF-Net) to explore the interaction learning between different higher-order feature representations. These include (1) an object-level image generator (OIG): the intersection of the output feature maps of the last two convolutional blocks of the backbone network is used as an object mask and mapped to the original image for cropping to obtain an object-level image, which can effectively reduce the interference caused by complex backgrounds. (2) A saliency feature suppression module (SFSM): the most distinguishing part of the object image is obtained by a feature extractor, and the part is masked by a two-dimensional suppression method, which improves the accuracy of feature suppression. (3) A cross-feature fusion method (CFM) based on inter-layer interaction: the output feature maps of different network layers are interactively integrated to obtain high-dimensional features, and then the high-dimensional features are channel compressed to obtain the inter-layer interaction feature representation, which enriches the output feature semantic information. The proposed SFSCF-Net can be trained end-to-end and achieves state-of-the-art or competitive results on four FGVC benchmark datasets.

Siamese self-supervised learning for fine-grained visual classification

Fine-Grained Visual Categorization With Fine-Tuned Segmentation

Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment.

Siamese Image Modeling for Self-Supervised Vision Representation Learning

Significant feature suppression and cross-feature fusion networks for fine-grained visual classification

Fine-Grained Visual Classification with Efficient End-to-end Localization

Multi-directional guidance network for fine-grained visual classification

Crafting Better Contrastive Views for Siamese Representation Learning

Bridge the gap between supervised and unsupervised learning for fine-grained classification

Cross-layer Navigation Convolutional Neural Network for Fine-grained Visual Classification

Attention-based Multi-scale ViT Fine-grained Visual Classification

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization

Fine-Grained Image Classification Via Combining Vision And Language

Joint Classification and Regression for Visual Tracking with Fully Convolutional Siamese Networks

ELoPE: Fine-Grained Visual Classification with Efficient Localization, Pooling and Embedding

Feature Boosting, Suppression, and Diversification for Fine-Grained Visual Classification.

Exploration of Class Center for Fine-Grained Visual Classification

SIM-OFE: Structure Information Mining and Object-Aware Feature Enhancement for Fine-Grained Visual Categorization

Exploring Localization for Self-supervised Fine-grained Contrastive Learning

Grad-CAM guided channel-spatial attention module for fine-grained visual classification