Abstract:Deep neural networks have demonstrated remarkable advancements in various fields using large, well-annotated datasets. However, real-world data often exhibit long-tailed distributions and label noise, significantly degrading generalization performance. Recent studies addressing these issues have focused on noisy sample selection methods that estimate the centroid of each class based on high-confidence samples within each target class. The performance of these methods is limited because they use only the training samples within each class for class centroid estimation, making the quality of centroids susceptible to long-tailed distributions and noisy labels. In this study, we present a robust training framework called Distribution-aware Sample Selection and Contrastive Learning (DaSC). Specifically, DaSC introduces a Distribution-aware Class Centroid Estimation (DaCC) to generate enhanced class centroids. DaCC performs weighted averaging of the features from all samples, with weights determined based on model predictions. Additionally, we propose a confidence-aware contrastive learning strategy to obtain balanced and robust representations. The training samples are categorized into high-confidence and low-confidence samples. Our method then applies Semi-supervised Balanced Contrastive Loss (SBCL) using high-confidence samples, leveraging reliable label information to mitigate class bias. For the low-confidence samples, our method computes Mixup-enhanced Instance Discrimination Loss (MIDL) to improve their representations in a self-supervised manner. Our experimental results on CIFAR and real-world noisy-label datasets demonstrate the superior performance of the proposed DaSC compared to previous approaches.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem that the generalization performance of deep neural networks significantly degrades under the conditions of long - tailed distribution and noisy labels. Specifically:
1. **Long - tailed distribution**: Data in the real world usually exhibits a long - tailed distribution, that is, the number of samples in some categories is far greater than that in other categories. This imbalance will lead to poor prediction performance of the model for minority categories.
2. **Noisy labels**: Due to human errors in the data annotation process or the limitations of automated annotation tools, there may be a large number of incorrect labels in the training data. These noisy labels will affect the learning effect of the model and lead to a decline in its performance.
Although existing methods have solved the above problems to a certain extent, they have the following deficiencies:
- **Only using high - confidence samples**: Traditional methods select high - confidence samples by estimating the center point of each category, but this ignores the information of low - confidence samples, resulting in inaccurate class center estimation.
- **Uniform weighting**: When calculating the class center, existing methods assign the same weight to all samples, which may magnify the influence of misclassified samples in the presence of label noise.
- **Lack of representation enhancement**: Existing methods do not actively enhance the quality of feature representation, especially in the tail categories, where high - quality feature representation is particularly important.
For this reason, this paper proposes a new training framework - Distribution - aware Sample Selection and Contrastive Learning (DaSC) to solve these problems. The main contributions of DaSC include:
1. **Distribution - aware Class Centroid Estimation (DaCC)**: By introducing the method of temperature scaling, the sample is weighted and averaged according to the probability predicted by the model, so as to estimate the class center more accurately. This method not only utilizes all samples, but also improves the reliability of class center estimation.
2. **Confidence - aware Contrastive Learning**: The samples are divided into two groups of high - confidence and low - confidence, and different contrastive loss functions (Semi - supervised Balanced Contrastive Loss, SBCL and Mixup - enhanced Instance Discrimination Loss, MIDL) are applied respectively to obtain a balanced and robust feature representation. SBCL uses reliable label information to alleviate category bias, while MIDL enhances the instance discrimination loss through mixup to improve the representation quality of low - confidence samples in a self - supervised manner.
3. **Experimental results**: Experiments on CIFAR and real - world noisy - label datasets show that DaSC outperforms existing methods in multiple configurations, especially in scenarios where long - tailed distribution and noisy labels co - exist.
Through these improvements, DaSC can effectively cope with the challenges brought by long - tailed distribution and noisy labels and significantly improve the generalization performance of the model.