Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation

Ying Zhang,Li Deng,Bo Wei

DOI: https://doi.org/10.3390/math12111709

IF: 2.4

2024-05-31

Mathematics

Abstract:Oversampling techniques are widely used to rebalance imbalanced datasets. However, most of the oversampling methods may introduce noise and fuzzy boundaries for dataset classification, leading to the overfitting phenomenon. To solve this problem, we propose a new method (FSDR-SMOTE) based on Random-SMOTE and Feature Standard Deviation for rebalancing imbalanced datasets. The method first removes noisy samples based on the Tukey criterion and then calculates the feature standard deviation reflecting the degree of data discretization to detect the sample location, and classifies the samples into boundary samples and safety samples. Secondly, the K-means clustering algorithm is employed to partition the minority class samples into several sub-clusters. Within each sub-cluster, new samples are generated based on random samples, boundary samples, and the corresponding sub-cluster center. The experimental results show that the average evaluation value obtained by FSDR-SMOTE is 93.31% (93.16%, and 86.53%) in terms of the F-measure (G-mean, and MCC) on the 20 benchmark datasets selected from the UCI machine learning library.

mathematics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in imbalanced data classification, existing oversampling methods may introduce noise and blur boundaries, leading to over - fitting. Specifically, when dealing with imbalanced data sets, traditional classification algorithms usually assume that the number of samples in each category is balanced during the training phase, and thus perform poorly when dealing with imbalanced data. In particular, because the number of samples in the majority class is large, the model is more inclined to learn the characteristics of the majority class during the training process, and thus is biased towards the majority class during prediction. This may lead to a decline in the recognition ability of samples in the minority class, or even completely ignore the importance of the minority class, and ultimately lead to misclassification in practical applications. To overcome these problems, the author proposes a new method (FSDR - SMOTE), which re - balances imbalanced data sets based on the improved Random - SMOTE (Random - SMOTE) and feature standard deviation. The main steps of FSDR - SMOTE include: 1. **Data pre - processing**: Use the Tukey criterion to remove noise samples, and use the K - means clustering algorithm to cluster minority - class samples. 2. **Boundary sample screening**: Detect sample positions by calculating the feature standard deviation, and divide minority - class samples into boundary samples and safe samples. 3. **New sample synthesis**: Generate new samples within each sub - cluster based on random samples, boundary samples and the corresponding sub - cluster centers. Experimental results show that FSDR - SMOTE performs better than other oversampling methods on multiple benchmark data sets, especially achieving significant improvements in the F - measure, G - mean and MCC indicators.

Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

An oversampling FCM-KSMOTE algorithm for imbalanced data classification

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering

Over-sampling algorithm for imbalanced data classification

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

An Over Sampling Method of Unbalanced Data Based on Ant Colony Clustering

Resampling approach for imbalanced data classification based on class instance density per feature value intervals

Improved SVM algorithm for imbalanced dataset classification

An Empirical Study on the Joint Impact of Feature Selection and Data Resampling on Imbalance Classification

Hybrid SVM algorithm oriented to classifying imbalanced datasets

A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification

A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

An Improved Algorithm for Imbalanced Data and Small Sample Size Classification

Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering

An Improving Majority Weighted Minority Oversampling Technique for Imbalanced Classification Problem