CoLafier: Collaborative Noisy Label Purifier With Local Intrinsic Dimensionality Guidance

Dongyu Zhang,Ruofan Hu,Elke Rundensteiner
2024-01-10
Abstract:Deep neural networks (DNNs) have advanced many machine learning tasks, but their performance is often harmed by noisy labels in real-world data. Addressing this, we introduce CoLafier, a novel approach that uses Local Intrinsic Dimensionality (LID) for learning with noisy labels. CoLafier consists of two subnets: LID-dis and LID-gen. LID-dis is a specialized classifier. Trained with our uniquely crafted scheme, LID-dis consumes both a sample's features and its label to predict the label - which allows it to produce an enhanced internal representation. We observe that LID scores computed from this representation effectively distinguish between correct and incorrect labels across various noise scenarios. In contrast to LID-dis, LID-gen, functioning as a regular classifier, operates solely on the sample's features. During training, CoLafier utilizes two augmented views per instance to feed both subnets. CoLafier considers the LID scores from the two views as produced by LID-dis to assign weights in an adapted loss function for both subnets. Concurrently, LID-gen, serving as classifier, suggests pseudo-labels. LID-dis then processes these pseudo-labels along with two views to derive LID scores. Finally, these LID scores along with the differences in predictions from the two subnets guide the label update decisions. This dual-view and dual-subnet approach enhances the overall reliability of the framework. Upon completion of the training, we deploy the LID-gen subnet of CoLafier as the final classification model. CoLafier demonstrates improved prediction accuracy, surpassing existing methods, particularly under severe label noise. For more details, see the code at
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform accurate classification tasks in the presence of noisy labels. Specifically, deep neural networks (DNNs) often encounter noisy labels in real - world data, and these noisy labels can impair the performance and generalization ability of the model. Therefore, this paper proposes a new method - CoLafier, which aims to deal with the learning problem with noisy labels by using Local Intrinsic Dimensionality (LID). ### Problem Definition Given a training set $\tilde{D}=\{(x_i,\tilde{y}_i)\}_{i = 1}^N$ containing noisy labels, where each $\tilde{y}_i$ is a one - hot vector representing the noisy label of instance $x_i$. The goal is to train a robust classification model $f(x;\Theta)\to\hat{y}$ that can accurately predict the true label $y_i$ of the instance without prior knowledge of the quality or correctness of the label. ### Challenges 1. **Lack of knowledge about noise proportion and pattern**: Without knowing the proportion and pattern of noisy labels in the dataset, it is difficult to develop a general method to collect enough clean labels to train a powerful model. 2. **Accumulation of errors during the training process**: Early selection or correction of errors may accumulate, leading to larger errors and making the model deviate from the expected results. ### Proposed Method To solve these problems, this paper introduces the CoLafier framework, which uses LID scores to distinguish between correctly and incorrectly labeled samples and enhances the learning process in the following ways: 1. **LID - dis sub - network**: This is a specialized classifier that processes not only the features of the sample but also its label. Through the training scheme, LID - dis can generate enhanced internal representations, thereby effectively distinguishing between correct and incorrect labels. 2. **LID - gen sub - network**: This is a regular classification model that operates only based on the features of the sample. During the training process, CoLafier uses two enhanced views to be input into the two sub - networks respectively. LID - dis assigns weights to each instance based on the LID scores generated from these two views and guides the label update decision. 3. **Dual - view and dual - sub - network method**: This method enhances the reliability of the entire framework. Especially in the case of severe label noise, CoLafier shows better prediction accuracy than existing methods. ### Main Contributions 1. **Innovative use of LID scores**: Developed the LID - dis sub - network, which can process the features and labels of samples simultaneously, generate enhanced representations, and effectively distinguish between correct and incorrect labels under different noise conditions. 2. **Introduction of the CoLafier framework**: This framework combines the two sub - networks of LID - dis and LID - gen, uses the LID scores in the two enhanced views to weight the loss function, and guides the label update decision according to the LID scores and prediction differences. 3. **Experimental proof of effectiveness**: Even without explicit noise feature information, CoLafier can still show better performance than existing methods under various noise conditions. Through these methods, CoLafier shows significant advantages in dealing with noisy labels, especially in the case of severe label noise.