Abstract:Self-supervised learning is emerging in fine-grained visual recognition with promising results. However, existing self-supervised learning methods are often susceptible to irrelevant patterns in self-supervised tasks and lack the capability to represent the subtle differences inherent in fine-grained visual recognition (FGVR), resulting in generally poorer performance. To address this, we propose a novel Priority-Perception Self-Supervised Learning framework, denoted as PP-SSL, which can effectively filter out irrelevant feature interference and extract more subtle discriminative features throughout the training process. Specifically, it composes of two main parts: the Anti-Interference Strategy (AIS) and the Image-Aided Distinction Module (IADM). In AIS, a fine-grained textual description corpus is established, and a knowledge distillation strategy is devised to guide the model in eliminating irrelevant features while enhancing the learning of more discriminative and high-quality features. IADM reveals that extracting GradCAM from the original image effectively reveals subtle differences between fine-grained categories. Compared to features extracted from intermediate or output layers, the original image retains more detail, allowing for a deeper exploration of the subtle distinctions among fine-grained classes. Extensive experimental results indicate that the PP-SSL significantly outperforms existing methods across various datasets, highlighting its effectiveness in fine-grained recognition tasks. Our code will be made publicly available upon publication.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems existing in the current self - supervised learning (SSL) methods in the fine - grained visual recognition (FGVR) tasks: 1. **Interference from irrelevant features**: - In self - supervised learning tasks, the model is easily affected by patterns irrelevant to the task (such as background noise). These irrelevant features can lead to feature entanglement and affect the discrimination between fine - grained categories. - When dealing with FGVR tasks, the existing SSL methods often fail to effectively filter out these irrelevant features, resulting in a decline in performance. 2. **Insufficient representation of fine - grained features**: - FGVR tasks require the model to be able to capture subtle visual differences, such as the subtle differences between different bird species, aircraft models or vehicle types. - Existing methods have difficulty accurately representing these subtle features, especially when dealing with cases where the inter - class differences are small but the intra - class differences are large. To solve these problems, the authors propose a new priority - perception self - supervised learning framework (PP - SSL). This framework improves the effect of fine - grained visual recognition through the following two key components: - **Anti - Interference Strategy (AIS)**: - Utilize the fine - grained text corpus and knowledge distillation strategy to guide the model to eliminate the interference of irrelevant features and enhance the learning of high - quality features. - **Image - Aided Distinction Module (IADM)**: - Extract GradCAM from the original image, focus on subtle category differences, reduce the impact of inter - class differences and improve intra - class consistency. Through these improvements, PP - SSL can significantly improve the performance of fine - grained visual recognition tasks on multiple benchmark datasets, especially in retrieval and classification tasks. ### Formula presentation The formulas involved in the paper are as follows: 1. **Contrastive learning loss function**: \[ L_{CL}(q, k)=-\log\frac{\exp(q\cdot k / \tau)}{\sum_{i = 1}^{K}\exp((q\cdot k_i)/\tau)} \] where \( q \) and \( k \) are positive sample pairs, \( k_i \) is a negative sample, and \( \tau \) is a temperature parameter. 2. **Knowledge distillation loss function of AIS**: \[ L_{AIS}(l_t, l_s,\tau)=\tau^2\cdot KL(\sigma(l_t / \tau),\sigma(l_s / \tau)) \] where \( l_t \) and \( l_s \) are the predicted logits of the teacher model and the student model respectively, \( \sigma \) is the softmax function, and \( KL \) is the Kullback - Leibler divergence. 3. **Optimization objective of IADM**: \[ L_{IADM}(\text{Grad - Img}\|w)=\text{Grad - Img}\cdot\log\left(\frac{\text{Grad - Img}}{w}\right) \] 4. **Total loss function**: \[ L_{total}=L_{CL}+\alpha L_{AIS}+\beta L_{IADM} \] where \( \alpha = 1.2 \) and \( \beta = 0.01 \) are hyperparameters that control the weights of each loss term. These formulas ensure that the model can during the training process.

PP-SSL : Priority-Perception Self-Supervised Learning for Fine-Grained Recognition

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

Learning Common Rationale to Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems

Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment.

From the whole to detail: Progressively sampling discriminative parts for fine-grained recognition

PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning

Dynamic Perception Framework for Fine-Grained Recognition

Selective Sparse Sampling for Fine-Grained Image Recognition

LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition

Weakly Supervised Fine-Grained Image Recognition Based on Multi-Channel Attention and Object Localization

On the Discriminability of Self-Supervised Representation Learning

Roll With the Punches: Expansion and Shrinkage of Soft Label Selection for Semi-supervised Fine-Grained Learning

Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation

Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained Visual Categorization

Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach

Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual Recognition

Self Supervision to Distillation for Long-Tailed Visual Recognition

Siamese self-supervised learning for fine-grained visual classification

Self-supervised learning of pseudo classes for generalized zero-shot fine-grained recognition

CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding

Efficient Fine-Grained Object Recognition in High-Resolution Remote Sensing Images From Knowledge Distillation to Filter Grafting