Abstract:Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most renowned methods for handling imbalanced data. However, both SMOTE and its variants have limitations due to their insufficient consideration of data distribution, leading to the generation of incorrect and unnecessary samples. This paper, therefore, introduces a novel oversampling algorithm called data distribution and spectral clustering-based SMOTE (DDSC-SMOTE). This algorithm addresses the shortcomings of SMOTE by introducing three innovative data distribution-based improvement strategies: adaptive allocation of synthetic sample quantities strategy, seed sample adaptive selection strategy, and synthetic sample improvement strategy. First, we use the k -nearest neighbor sample labels and the local outlier factor algorithm to remove noisy and outlier samples. Next, we leverage spectral clustering to identify clusters within the minority class and propose a dual-weight factor that considers inter-cluster and intra-cluster distances to allocate the number of synthetic samples effectively, addressing interclass and intraclass imbalances. Furthermore, we introduce a relative position weight coefficient to determine the probability of selecting seed samples within the subcluster, ensuring that important minority samples have higher chances of being sampled. Finally, we improve the SMOTE sample synthesis formula for safer generation. Extensive comparisons on real datasets from the UCI repository demonstrate that DDSC-SMOTE outperforms seven state-of-the-art oversampling algorithms significantly in terms of G -mean and F 1-score, presenting a data distribution-focused solution for addressing imbalanced data challenges.

CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification

A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification

Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise

CSMOTE: Contrastive Synthetic Minority Oversampling for Imbalanced Time Series Classification.

CMO-SMOTE: Misclassification Cost Minimization Oriented Synthetic Minority Oversampling Technique for Imbalanced Learning

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

SMOTE: Synthetic Minority Over-sampling Technique

A Novel SMOTE-Based Classification Approach to Online Data Imbalance Problem

Grouped SMOTE with Noise Filtering Mechanism for Classifying Imbalanced Data.

A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios

The SVM Classifier for Unbalanced Data Based on Combination of RU-Undersample and SMOTE

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

Over-sampling algorithm for imbalanced data classification

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering

TDMO: Dynamic Multi-Dimensional Oversampling for Exploring Data Distribution Based on Extreme Gradient Boosting Learning.

Abstention-Smote: An Over-Sampling Approach For Imbalanced Data Classification

Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification

Importance-SMOTE: a Synthetic Minority Oversampling Method for Noisy Imbalanced Data

Radial-Based Undersampling for imbalanced data classification

Minority-prediction-probability-based Oversampling Technique for Imbalanced Learning