Abstract:The technique of extracting different distinguishing features by locating different part regions to achieve fine-grained visual classification (FGVC) has made significant improvements. Utilizing attention mechanisms for feature extraction has become one of the mainstream methods in computer vision, but these methods have certain limitations. They typically focus on the most discriminative regions and directly combine the features of these parts, neglecting other less prominent yet still discriminative regions. Additionally, these methods may not fully explore the intrinsic connections between higher-order and lower-order features to optimize model classification performance. By considering the potential relationships between different higher-order feature representations in the object image, we can enable the integrated higher-order features to contribute more significantly to the model's classification decision-making capabilities. To this end, we propose a saliency feature suppression and cross-feature fusion network model (SFSCF-Net) to explore the interaction learning between different higher-order feature representations. These include (1) an object-level image generator (OIG): the intersection of the output feature maps of the last two convolutional blocks of the backbone network is used as an object mask and mapped to the original image for cropping to obtain an object-level image, which can effectively reduce the interference caused by complex backgrounds. (2) A saliency feature suppression module (SFSM): the most distinguishing part of the object image is obtained by a feature extractor, and the part is masked by a two-dimensional suppression method, which improves the accuracy of feature suppression. (3) A cross-feature fusion method (CFM) based on inter-layer interaction: the output feature maps of different network layers are interactively integrated to obtain high-dimensional features, and then the high-dimensional features are channel compressed to obtain the inter-layer interaction feature representation, which enriches the output feature semantic information. The proposed SFSCF-Net can be trained end-to-end and achieves state-of-the-art or competitive results on four FGVC benchmark datasets.

Integrating Foreground–background Feature Distillation and Contrastive Feature Learning for Ultra-Fine-grained Visual Classification

Fine-Grained Visual Categorization With Fine-Tuned Segmentation

Learning Enhanced Features and Inferring Twice for Fine-Grained Image Classification

FET-FGVC: Feature-enhanced transformer for fine-grained visual classification

Multi-Granularity Feature Distillation Learning Network for Fine-Grained Visual Classification

Feature Re-Attention and Multi-Layer Feature Fusion for Fine-Grained Visual Classification

Cross-layer Progressive Attention Bilinear Fusion Method for Fine-Grained Visual Classification

Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples

Leveraging Fine-Grained Labels to Regularize Fine-Grained Visual Classification.

CLE-ViT: Contrastive Learning Encoded Transformer for Ultra-Fine-Grained Visual Categorization

Channel Boosting, Cross-Layer Feature Integration, and Multi-Scale Classification for Fine-Grained Visual Classification

Bridge the gap between supervised and unsupervised learning for fine-grained classification

Fine-Graine Visual Classification with Aggregated Object Localization and Salient Feature Suppression

Exploiting Category Similarity-Based Distributed Labeling for Fine-Grained Visual Classification

Fine-Grained Visual Classification Via Simultaneously Learning of Multi-regional Multi-grained Features

Granularity-aware Distillation and Structure Modeling Region Proposal Network for Fine-Grained Image Classification.

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

Data-free Knowledge Distillation for Fine-grained Visual Categorization

Diving into Continual Ultra-fine-grained Visual Categorization

Progressive Self-Guided Hardness Distillation for Fine-Grained Visual Classification

Significant feature suppression and cross-feature fusion networks for fine-grained visual classification