The fuzzy support vector data description based on tightness for noisy label detection

Xiaoying Wu,Sanyang Liu,Yiguang Bai
DOI: https://doi.org/10.1007/s40747-024-01356-9
IF: 6.7
2024-03-04
Complex & Intelligent Systems
Abstract:Abstract Machine learning (ML) is an approach driven by data, and as research in machine learning progresses, the issue of noisy labels has garnered widespread attention. Noisy labels can significantly reduce the accuracy of supervised classification models, making it important to address this problem. Therefore, it is a very meaningful task to detect as many noisy labels as possible from the big data. In this study, a new method is proposed for detecting noisy labels in datasets. This method leverages a deep pre-trained network to extract a feature set from the image data first which can extract more accurate data features. Then, a membership degree based on tightness into the support vector data description (SVDD) model named TF-SVDD is introduced to detect noisy data in the dataset. In order to simulate different types of label noise more accurately, we first assumed that the labels of the datasets used were all correct, and in addition constructed the noise set using two method: the density peak noise set and the random noise set. Experimental results demonstrate that the TF-SVDD can effectively detect noisy label data, surpassing traditional support vector data description algorithms and other methods in terms of outlier detection accuracy, with the average accuracy mostly exceeding 50 $$\%$$ % , and even reaching 80 $$\%$$ % . Furthermore, one novel measure called ‘confidence’ is employed to rectify noisy labels in the data. Following the correction of noisy labels, the accuracy of image classification experiences a significant improvement, with the average promotion ratio mostly exceeding 10 $$\%$$ % , and reaching 30 $$\%$$ % .
computer science, artificial intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issue of noisy labels in machine learning. Specifically, noisy labels can significantly reduce the accuracy of supervised classification models, so detecting and correcting these noisy labels is crucial for improving model performance. The paper proposes a Tightness-based Fuzzy Support Vector Data Description (TF-SVDD) method for detecting noisy labels from large datasets. ### Main Contributions 1. **Construction of Initial Noise Set**: - Using the traditional density peak clustering algorithm to construct the initial noise set. 2. **Tightness-based Fuzzy SVDD Model**: - Introducing a new method to more accurately distinguish noisy samples through a tightness-based fuzzy SVDD model. 3. **New Confidence Metric**: - Proposing a new confidence metric to correct noisy labels. ### Method Overview 1. **Feature Extraction**: - Using a pre-trained ResNet-18 network to extract features from image data. 2. **Generation of Initial Noise Set**: - Constructing the noise set using two methods: random selection and density peak algorithm. 3. **Fuzzy Membership Function**: - Designing a tightness-based fuzzy membership function that considers the distance between samples and class centers as well as the compactness of intra-class samples. 4. **TF-SVDD Model**: - Integrating the fuzzy membership function into the SVDD model and optimizing the objective function to detect noisy labels. 5. **Noise Label Correction**: - Using the confidence metric to correct detected noisy labels and evaluating classification accuracy through SVM. ### Experimental Results The paper conducted experiments on three color image datasets (cats and dogs, fruits, utensils) with 20%, 40%, and 60% random noise and density noise added. The experimental results show that the TF-SVDD method outperforms traditional SVDD and other methods in both noisy label detection and classification accuracy. ### Conclusion The TF-SVDD method proposed in this study can effectively detect and correct noisy labels, significantly improving the accuracy of image classification. By introducing a tightness-based fuzzy membership function and confidence metric, this method excels in handling the noisy label problem.