Abstract:Imbalanced data poses a significant challenge in machine learning, as conventional classification algorithms often prioritize majority class samples, while accurately classifying minority class samples is more crucial. The synthetic minority oversampling technique (SMOTE) represents one of the most renowned methods for handling imbalanced data. However, both SMOTE and its variants have limitations due to their insufficient consideration of data distribution, leading to the generation of incorrect and unnecessary samples. This paper, therefore, introduces a novel oversampling algorithm called data distribution and spectral clustering-based SMOTE (DDSC-SMOTE). This algorithm addresses the shortcomings of SMOTE by introducing three innovative data distribution-based improvement strategies: adaptive allocation of synthetic sample quantities strategy, seed sample adaptive selection strategy, and synthetic sample improvement strategy. First, we use the k -nearest neighbor sample labels and the local outlier factor algorithm to remove noisy and outlier samples. Next, we leverage spectral clustering to identify clusters within the minority class and propose a dual-weight factor that considers inter-cluster and intra-cluster distances to allocate the number of synthetic samples effectively, addressing interclass and intraclass imbalances. Furthermore, we introduce a relative position weight coefficient to determine the probability of selecting seed samples within the subcluster, ensuring that important minority samples have higher chances of being sampled. Finally, we improve the SMOTE sample synthesis formula for safer generation. Extensive comparisons on real datasets from the UCI repository demonstrate that DDSC-SMOTE outperforms seven state-of-the-art oversampling algorithms significantly in terms of G -mean and F 1-score, presenting a data distribution-focused solution for addressing imbalanced data challenges.

Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Imbalanced Data Classification Algorithm Based on Integrated Sampling and Ensemble Learning.

A Classfication Method For Imbalance Data Set Based on Kernel SMOTE

Imbalanced medical disease dataset classification using enhanced generative adversarial network

SMOTified-GAN for class imbalanced pattern classification problems

A Novel Adaptive Minority Oversampling Technique for Improved Classification in Data Imbalanced Scenarios

Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis

Enhancing Skin Disease Classification: A Novel Approach With Tailored Loss Functions And SMOTE Sumeet Ghumare

A hybrid sampling method for highly imbalanced and overlapped data classification with complex distribution

A novel generative adversarial networks modelling for the class imbalance problem in high dimensional omics data

Synthetic Information towards Maximum Posterior Ratio for deep learning on Imbalanced Data

GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets

Over-sampling algorithm for imbalanced data classification

Modified-generative adversarial networks for imbalance text classification

An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification

Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms

Synthetic Boosted Resampling Using Deep Generative Adversarial Networks: A Novel Approach to Improve Cancer Prediction from Imbalanced Datasets

WGAN-Based Synthetic Minority Over-Sampling Technique: Improving Semantic Fine-Grained Classification for Lung Nodules in CT Images.

Binary imbalanced data classification based on diversity oversampling by generative models

DDSC-SMOTE: an imbalanced data oversampling algorithm based on data distribution and spectral clustering