An accurate detection is not all you need to combat label noise in web-noisy datasets

Paul Albert,Jack Valmadre,Eric Arazo,Tarun Krishna,Noel E. O'Connor,Kevin McGuinness

2024-07-08

Abstract:Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or distance-based methods despite being poorly separated from the OOD distribution using unsupervised learning. Because we further observe a low correlation with SOTA metrics, this urges us to propose a hybrid solution that alternates between noise detection using linear separation and a state-of-the-art (SOTA) small-loss approach. When combined with the SOTA algorithm PLS, we substantially improve SOTA results for real-world image classification in the presence of web noise <a class="link-external link-http" href="http://github.com/PaulAlbert31/LSA" rel="external noopener nofollow">this http URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of combating label noise in datasets crawled from the web, particularly in image classification tasks. Specifically, the paper focuses on how to improve the robustness of classifiers in the presence of a large number of out-of-distribution (OOD) noisy samples. The main contributions of the paper are as follows: 1. **Proposed a new noise detection method**: Extended the work of SNCF by explicitly estimating the linear separation between in-distribution (ID) samples and OOD samples to improve the detection of OOD samples. This method performs well on real-world web noise datasets and is weakly correlated with existing small-loss and distance-based methods. 2. **Explored the difference between noise retrieval performance and classification accuracy**: Found that although some noise detection methods perform well in noise retrieval, they may not be effective in classification accuracy. 3. **Proposed the Linear Separation Alternating (LSA) strategy**: Combined linear separation with unrelated state-of-the-art noise detection methods, alternating the use of each method to improve noise detection effectiveness. This strategy significantly enhances the performance of existing noise-robust algorithms in various classification tasks. 4. **Conducted a series of experiments and ablation studies**: Including a voting co-training strategy (PLS-LSA+), and validated the effectiveness of the proposed algorithm on controlled and real-world web noise datasets. In summary, the paper aims to improve classification performance on datasets with label noise by combining different types of noise detection methods.

An accurate detection is not all you need to combat label noise in web-noisy datasets

Prototype-Based Supervised Contrastive Learning Method for Noisy Label Correction in Tire Defect Detection

A Label Noise Robust Stacked Auto-Encoder Algorithm for Inaccurate Supervised Classification Problems

Embedding contrastive unsupervised features to cluster in- and out-of-distribution noise in corrupted image datasets

A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?

Noise-Aware Fully Webly Supervised Object Detection.

Learning With Non-Uniform Label Noise: A Cluster-Dependent Weakly Supervised Approach.

Exploiting Web Images for Fine-Grained Visual Recognition by Eliminating Open-Set Noise and Utilizing Hard Examples

Combating Label Noise With A General Surrogate Model For Sample Selection

OT Cleaner: Label Correction As Optimal Transport

Towards Noise-resistant Object Detection with Noisy Annotations

Training CNN Classifiers Solely on Webly Data

Learning Sound Event Classifiers from Web Audio with Noisy Labels

Noisy Label Processing for Classification: A Survey

On Better Detecting and Leveraging Noisy Samples for Learning with Severe Label Noise

Robust Image Classification with Noisy Labels by Negative Learning and Feature Space Renormalization

HCL: Hierarchical Consistency Learning for Webly Supervised Fine-Grained Recognition

Model-agnostic Approaches to Handling Noisy Labels When Training Sound Event Classifiers

Decoding class dynamics in learning with noisy labels

LDAAD: an Effective Label De-noising Approach for Anomaly Detection

Web Image Annotation Based On Automatically Obtained Noisy Training Set