An accurate detection is not all you need to combat label noise in web-noisy datasets

Paul Albert,Jack Valmadre,Eric Arazo,Tarun Krishna,Noel E. O'Connor,Kevin McGuinness
2024-07-08
Abstract:Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or distance-based methods despite being poorly separated from the OOD distribution using unsupervised learning. Because we further observe a low correlation with SOTA metrics, this urges us to propose a hybrid solution that alternates between noise detection using linear separation and a state-of-the-art (SOTA) small-loss approach. When combined with the SOTA algorithm PLS, we substantially improve SOTA results for real-world image classification in the presence of web noise <a class="link-external link-http" href="http://github.com/PaulAlbert31/LSA" rel="external noopener nofollow">this http URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of combating label noise in datasets crawled from the web, particularly in image classification tasks. Specifically, the paper focuses on how to improve the robustness of classifiers in the presence of a large number of out-of-distribution (OOD) noisy samples. The main contributions of the paper are as follows: 1. **Proposed a new noise detection method**: Extended the work of SNCF by explicitly estimating the linear separation between in-distribution (ID) samples and OOD samples to improve the detection of OOD samples. This method performs well on real-world web noise datasets and is weakly correlated with existing small-loss and distance-based methods. 2. **Explored the difference between noise retrieval performance and classification accuracy**: Found that although some noise detection methods perform well in noise retrieval, they may not be effective in classification accuracy. 3. **Proposed the Linear Separation Alternating (LSA) strategy**: Combined linear separation with unrelated state-of-the-art noise detection methods, alternating the use of each method to improve noise detection effectiveness. This strategy significantly enhances the performance of existing noise-robust algorithms in various classification tasks. 4. **Conducted a series of experiments and ablation studies**: Including a voting co-training strategy (PLS-LSA+), and validated the effectiveness of the proposed algorithm on controlled and real-world web noise datasets. In summary, the paper aims to improve classification performance on datasets with label noise by combining different types of noise detection methods.