Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation

Ying Zhang,Li Deng,Bo Wei
DOI: https://doi.org/10.3390/math12111709
IF: 2.4
2024-05-31
Mathematics
Abstract:Oversampling techniques are widely used to rebalance imbalanced datasets. However, most of the oversampling methods may introduce noise and fuzzy boundaries for dataset classification, leading to the overfitting phenomenon. To solve this problem, we propose a new method (FSDR-SMOTE) based on Random-SMOTE and Feature Standard Deviation for rebalancing imbalanced datasets. The method first removes noisy samples based on the Tukey criterion and then calculates the feature standard deviation reflecting the degree of data discretization to detect the sample location, and classifies the samples into boundary samples and safety samples. Secondly, the K-means clustering algorithm is employed to partition the minority class samples into several sub-clusters. Within each sub-cluster, new samples are generated based on random samples, boundary samples, and the corresponding sub-cluster center. The experimental results show that the average evaluation value obtained by FSDR-SMOTE is 93.31% (93.16%, and 86.53%) in terms of the F-measure (G-mean, and MCC) on the 20 benchmark datasets selected from the UCI machine learning library.
mathematics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in imbalanced data classification, existing oversampling methods may introduce noise and blur boundaries, leading to over - fitting. Specifically, when dealing with imbalanced data sets, traditional classification algorithms usually assume that the number of samples in each category is balanced during the training phase, and thus perform poorly when dealing with imbalanced data. In particular, because the number of samples in the majority class is large, the model is more inclined to learn the characteristics of the majority class during the training process, and thus is biased towards the majority class during prediction. This may lead to a decline in the recognition ability of samples in the minority class, or even completely ignore the importance of the minority class, and ultimately lead to misclassification in practical applications. To overcome these problems, the author proposes a new method (FSDR - SMOTE), which re - balances imbalanced data sets based on the improved Random - SMOTE (Random - SMOTE) and feature standard deviation. The main steps of FSDR - SMOTE include: 1. **Data pre - processing**: Use the Tukey criterion to remove noise samples, and use the K - means clustering algorithm to cluster minority - class samples. 2. **Boundary sample screening**: Detect sample positions by calculating the feature standard deviation, and divide minority - class samples into boundary samples and safe samples. 3. **New sample synthesis**: Generate new samples within each sub - cluster based on random samples, boundary samples and the corresponding sub - cluster centers. Experimental results show that FSDR - SMOTE performs better than other oversampling methods on multiple benchmark data sets, especially achieving significant improvements in the F - measure, G - mean and MCC indicators.