Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Yiheng Chen,Jinbai Zou,Lihai Liu,Chuanbo Hu
DOI: https://doi.org/10.3390/sym16030273
2024-02-27
Symmetry
Abstract:The problems of imbalanced datasets are generally considered asymmetric issues. In asymmetric problems, artificial intelligence models may exhibit different biases or preferences when dealing with different classes. In the process of addressing class imbalance learning problems, the classification model will pay too much attention to the majority class samples and cannot guarantee the classification performance of the minority class samples, which might be more valuable. By synthesizing the minority class samples and changing the data distribution, unbalanced datasets can be optimized. Traditional oversampling algorithms have problems of blindness and boundary ambiguity when synthesizing new samples. A modified reclassification algorithm based on Gaussian distribution is put forward. First, the minority class samples are reclassified by the KNN algorithm. Then, different synthesis strategies are selected according to the combination of the minority class samples, and the Gaussian distribution is used to replace the uniform random distribution for interpolation operation under certain classification conditions to reduce the possibility of generating noise samples. The experimental results indicate that the proposed oversampling algorithm can achieve a performance improvement of 2∼8% in evaluation metrics, including G-mean, F-measure, and AUC, compared to traditional oversampling algorithms.
multidisciplinary sciences
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the blindness and boundary ambiguity in the process of synthesizing new samples by traditional over - sampling algorithms when dealing with imbalanced data sets. Specifically: 1. **Blindness**: When generating new minority - class samples, traditional over - sampling algorithms do not fully consider the distribution characteristics of minority - class samples and may generate noise samples. These samples not only do not provide useful information but may also interfere with the performance of the classifier. 2. **Boundary ambiguity**: Traditional over - sampling algorithms ignore the boundary characteristics in the sample set, resulting in the generation of too many new samples in areas far from the class boundaries. This may significantly affect the performance of the classifier and blur the boundaries between the two classes. To solve these problems, the paper proposes an improved over - sampling algorithm based on K - nearest neighbors (KNN) and Gaussian distribution interpolation optimization - GI - SMOTE (Gaussian Interpolation SMOTE). This algorithm improves the deficiencies of traditional over - sampling algorithms through the following steps: 1. **Classification stage**: - Use the KNN algorithm to re - classify minority - class samples and divide them into three categories: "noise", "dangerous", and "safe" according to the proximity type of each minority - class sample in the overall data set. - "Noise" samples are completely surrounded by majority - class samples and are considered invalid; "safe" samples are mainly surrounded by minority - class samples and are considered valid; "dangerous" samples are surrounded by both minority - class samples and majority - class samples and require special treatment. 2. **Data synthesis stage**: - In the data synthesis stage, "noise" samples are not used, and only "dangerous" and "safe" samples are used for interpolation operations. - For "dangerous" samples, different synthesis strategies are selected according to the type of their nearest neighbor samples: - If the nearest neighbor sample belongs to "safe" samples, a Gaussian distribution is used to generate new samples to emphasize the attribute information of "dangerous" samples. - If the nearest neighbor sample also belongs to "dangerous" samples, two Gaussian distributions with different means are used to generate new samples to reduce the interference of the generated samples on the classification performance. Through these improvements, the GI - SMOTE algorithm can more accurately preserve the characteristics of minority - class samples when generating new samples while avoiding generating noise samples and blurring class boundaries. Experimental results show that compared with traditional SMOTE, Borderline - SMOTE, and ADASYN algorithms, the GI - SMOTE algorithm achieves better classification performance on multiple imbalanced data sets, especially in evaluation indicators such as G - mean, F - measure, and AUC.