Abstract:Imbalanced learning is a common problem in data mining. There is a different distribution of data samples among other classes in the imbalanced datasets. It’s a challenge for standard algorithms designed for balanced class distributions. Although there are various strategies to solve this problem, generating artificial data to achieve a relatively balanced class distribution is universal rather than directly modifying specific classification algorithms. The oversampled data can be combined with any user-specified algorithm without any restrictions. In this paper, we present a novel oversampling method, Global Data Distribution Weighted Synthetic Oversampling Technique (GDDSYN). By applying clustering, optimizing the selection criteria of the minority class samples that are used to generate synthetic samples, avoiding generating more noise samples. GDDSYN assigns weights for the number of synthetic samples to tackle the within-class imbalance and between-class imbalance simultaneously, according to the informative level of the sample and the sparsity of the cluster to which the sample belongs. The use of scores with Silhouette Coefficient and Mutual Information helps the k-means algorithm set a reasonable number of clusters for the minority and majority classes respectively so that the clustering effect can be guaranteed. Next, by using clustering information, synthetic samples’ generation path is improved to avoid class overlap. Additionally, GDDSYN has been evaluated extensively on 10 artificial and 10 real-world data sets. The empirical results show that our method is outperforms or comparable with some other existing methods in terms of assessment metrics when artificial data generated by GDDSYN are used.

Generative adversarial minority enlargement—A local linear over-sampling synthetic method

A new imbalanced data oversampling method based on Bootstrap method and Wasserstein Generative Adversarial Network

An improved generative adversarial network to oversample imbalanced datasets

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning.

An ensemble oversampling method for imbalanced classification with prior knowledge via generative adversarial network

GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets

DGM: a data generative model to improve minority class presence in anomaly detection domain

Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis

Distribution Enhancement for Imbalanced Data with Generative Adversarial Network

CEGAN: Classification Enhancement Generative Adversarial Networks for unraveling data imbalance problems

Minimum Enclosing Ball Synthetic Minority Oversampling Technique from a Geometric Perspective

Constructing small sample datasets with game mixed sampling and improved genetic algorithm

SORAG: Synthetic Data Over-Sampling Strategy on Multi-Label Graphs

A hybrid sampling method for highly imbalanced and overlapped data classification with complex distribution

Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

Synthetic minority class data by generative adversarial network for imbalanced sar target recognition

Over-sampling algorithm for imbalanced data classification

Synthetic oversampling with Mahalanobis distance and local information for highly imbalanced class-overlapped data

Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning

BSGAN: A Novel Oversampling Technique for Imbalanced Pattern Recognitions

IDA-GAN: A Novel Imbalanced Data Augmentation GAN