Abstract:Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most renowned methods for handling imbalanced data. However, both SMOTE and its variants have limitations due to their insufficient consideration of data distribution, leading to the generation of incorrect and unnecessary samples. This paper, therefore, introduces a novel oversampling algorithm called data distribution and spectral clustering-based SMOTE (DDSC-SMOTE). This algorithm addresses the shortcomings of SMOTE by introducing three innovative data distribution-based improvement strategies: adaptive allocation of synthetic sample quantities strategy, seed sample adaptive selection strategy, and synthetic sample improvement strategy. First, we use the k -nearest neighbor sample labels and the local outlier factor algorithm to remove noisy and outlier samples. Next, we leverage spectral clustering to identify clusters within the minority class and propose a dual-weight factor that considers inter-cluster and intra-cluster distances to allocate the number of synthetic samples effectively, addressing interclass and intraclass imbalances. Furthermore, we introduce a relative position weight coefficient to determine the probability of selecting seed samples within the subcluster, ensuring that important minority samples have higher chances of being sampled. Finally, we improve the SMOTE sample synthesis formula for safer generation. Extensive comparisons on real datasets from the UCI repository demonstrate that DDSC-SMOTE outperforms seven state-of-the-art oversampling algorithms significantly in terms of G -mean and F 1-score, presenting a data distribution-focused solution for addressing imbalanced data challenges.

TDMO: Dynamic Multi-Dimensional Oversampling for Exploring Data Distribution Based on Extreme Gradient Boosting Learning.

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering

Over-sampling algorithm for imbalanced data classification

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning.

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling

SP-SMOTE: A novel space partitioning based synthetic minority oversampling technique

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

WOTBoost: Weighted Oversampling Technique in Boosting for imbalanced learning

A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios

Weighted Oversampling Algorithms for Imbalanced Problems and Application in Prediction of Streamflow.

ExNN-SMOTE - Extended Natural Neighbors Based SMOTE to Deal with Imbalanced Data.

Research on Datamining Method for Imbalanced Dataset Based on Improved SMOTE

A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data

Binary imbalanced data classification based on diversity oversampling by generative models

Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning

ASN-SMOTE: a Synthetic Minority Oversampling Method with Adaptive Qualified Synthesizer Selection

Increasing Oversampling Diversity for Long-Tailed Visual Recognition.

Oversampling With Reliably Expanding Minority Class Regions for Imbalanced Data Learning

Augmenting the diversity of imbalanced datasets via multi-vector stochastic exploration oversampling