Pairwise Similarity Distribution Clustering for Noisy Label Learning

Sihan Bai
2024-04-02
Abstract:Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effective sample selection algorithm, termed as Pairwise Similarity Distribution Clustering~(PSDC), to divide the training samples into one clean set and another noisy set, which can power any of the off-the-shelf semi-supervised learning regimes to further train networks for different downstream tasks. Specifically, we take the pairwise similarity between sample pairs to represent the sample structure, and the Gaussian Mixture Model~(GMM) to model the similarity distribution between sample pairs belonging to the same noisy cluster, therefore each sample can be confidently divided into the clean set or noisy set. Even under severe label noise rate, the resulting data partition mechanism has been proved to be more robust in judging the label confidence in both theory and practice. Experimental results on various benchmark datasets, such as CIFAR-10, CIFAR-100 and Clothing1M, demonstrate significant improvements over state-of-the-art methods.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily addresses the issue of how to train deep neural networks in the presence of noisy labels. Specifically: 1. **Objective of Noisy Label Learning**: - Utilize a large amount of data with noisy labels to train deep neural networks. - The main challenge is how to handle inaccurate supervision caused by incorrect labels. 2. **Problems with Existing Methods**: - Existing methods either adopt label correction approaches or sample selection methods to include more samples with accurate labels in the training process. - These methods still struggle to improve supervision quality in cases of severe noisy label rates. 3. **Proposed Method**: - A simple and effective sample selection algorithm called **Pairwise Similarity Distribution Clustering (PSDC)** is proposed. - The training samples are divided into a clean set and a noisy set, and these data are further used to train the network for different downstream tasks. - The sample structure is represented by calculating the pairwise similarity between sample pairs, and a Gaussian Mixture Model (GMM) is used to model the similarity distribution between sample pairs belonging to the same noise cluster. - This data partitioning mechanism is proven to be more robust both theoretically and practically, even under severe label noise rates. 4. **Main Contributions**: - A new PSDC method is proposed to improve the accuracy of data partitioning through pairwise sample structure and Gaussian Mixture Model. - Clear theoretical analysis of Jensen-Shannon divergence, cross-entropy criterion, and Gaussian Mixture Model is provided, demonstrating the method's broad noise tolerance range. - Extensive experiments on CIFAR-10, CIFAR-100, and Clothing1M datasets were conducted, achieving state-of-the-art results.