Abstract:The technique of extracting different distinguishing features by locating different part regions to achieve fine-grained visual classification (FGVC) has made significant improvements. Utilizing attention mechanisms for feature extraction has become one of the mainstream methods in computer vision, but these methods have certain limitations. They typically focus on the most discriminative regions and directly combine the features of these parts, neglecting other less prominent yet still discriminative regions. Additionally, these methods may not fully explore the intrinsic connections between higher-order and lower-order features to optimize model classification performance. By considering the potential relationships between different higher-order feature representations in the object image, we can enable the integrated higher-order features to contribute more significantly to the model's classification decision-making capabilities. To this end, we propose a saliency feature suppression and cross-feature fusion network model (SFSCF-Net) to explore the interaction learning between different higher-order feature representations. These include (1) an object-level image generator (OIG): the intersection of the output feature maps of the last two convolutional blocks of the backbone network is used as an object mask and mapped to the original image for cropping to obtain an object-level image, which can effectively reduce the interference caused by complex backgrounds. (2) A saliency feature suppression module (SFSM): the most distinguishing part of the object image is obtained by a feature extractor, and the part is masked by a two-dimensional suppression method, which improves the accuracy of feature suppression. (3) A cross-feature fusion method (CFM) based on inter-layer interaction: the output feature maps of different network layers are interactively integrated to obtain high-dimensional features, and then the high-dimensional features are channel compressed to obtain the inter-layer interaction feature representation, which enriches the output feature semantic information. The proposed SFSCF-Net can be trained end-to-end and achieves state-of-the-art or competitive results on four FGVC benchmark datasets.

FET-FGVC: Feature-enhanced transformer for fine-grained visual classification

Fine-Grained Visual Categorization With Fine-Tuned Segmentation

TransFG: A Transformer Architecture for Fine-Grained Recognition

AA-Trans: Core Attention Aggregating Transformer with Information Entropy Selector for Fine-grained Visual Classification

MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification

Attention-based Multi-scale ViT Fine-grained Visual Classification

Dual Transformer with Multi-Grained Assembly for Fine-Grained Visual Classification

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

Significant feature suppression and cross-feature fusion networks for fine-grained visual classification

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Cross-layer Navigation Convolutional Neural Network for Fine-grained Visual Classification

Hybrid ViT-CNN Network for Fine-Grained Image Classification

CLE-ViT: Contrastive Learning Encoded Transformer for Ultra-Fine-Grained Visual Categorization

PEDTrans: A Fine-Grained Visual Classification Model for Self-attention Patch Enhancement and Dropout.

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

Multi-level information fusion Transformer with background filter for fine-grained image recognition

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

On the Imaginary Wings: Text-Assisted Complex-Valued Fusion Network for Fine-Grained Visual Classification

Graph-in-graph Discriminative Feature Enhancement Network for Fine-Grained Visual Classification

A multimodal hyper-fusion transformer for remote sensing image classification