Abstract:Imbalanced learning is a common problem in data mining. There is a different distribution of data samples among other classes in the imbalanced datasets. It’s a challenge for standard algorithms designed for balanced class distributions. Although there are various strategies to solve this problem, generating artificial data to achieve a relatively balanced class distribution is universal rather than directly modifying specific classification algorithms. The oversampled data can be combined with any user-specified algorithm without any restrictions. In this paper, we present a novel oversampling method, Global Data Distribution Weighted Synthetic Oversampling Technique (GDDSYN). By applying clustering, optimizing the selection criteria of the minority class samples that are used to generate synthetic samples, avoiding generating more noise samples. GDDSYN assigns weights for the number of synthetic samples to tackle the within-class imbalance and between-class imbalance simultaneously, according to the informative level of the sample and the sparsity of the cluster to which the sample belongs. The use of scores with Silhouette Coefficient and Mutual Information helps the k-means algorithm set a reasonable number of clusters for the minority and majority classes respectively so that the clustering effect can be guaranteed. Next, by using clustering information, synthetic samples’ generation path is improved to avoid class overlap. Additionally, GDDSYN has been evaluated extensively on 10 artificial and 10 real-world data sets. The empirical results show that our method is outperforms or comparable with some other existing methods in terms of assessment metrics when artificial data generated by GDDSYN are used.

Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification

Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning

Statistics Enhancement Generative Adversarial Networks for Diverse Conditional Image Synthesis

A new imbalanced data oversampling method based on Bootstrap method and Wasserstein Generative Adversarial Network

An ensemble oversampling method for imbalanced classification with prior knowledge via generative adversarial network

Distribution Enhancement for Imbalanced Data with Generative Adversarial Network

An improved generative adversarial network to oversample imbalanced datasets

Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning

WGAN-Based Synthetic Minority Over-Sampling Technique: Improving Semantic Fine-Grained Classification for Lung Nodules in CT Images.

An intra-class distribution-focused generative adversarial network approach for imbalanced tabular data learning

Gene-CWGAN: a data enhancement method for gene expression profile based on improved CWGAN-GP

IB-GAN: A Unified Approach for Multivariate Time Series Classification under Class Imbalance

Oversampling Imbalanced Data Based on Convergent WGAN for Network Threat Detection

Annealing Genetic GAN for Imbalanced Web Data Learning

Over-sampling method for tackling class imbalance in software defect prediction based on generative adversarial networks

CEGAN: Classification Enhancement Generative Adversarial Networks for unraveling data imbalance problems

Modified-generative adversarial networks for imbalance text classification

Enhancing supervised analysis of imbalanced untargeted metabolomics datasets using a CWGAN-GP framework for data augmentation

A hybrid sampling method for highly imbalanced and overlapped data classification with complex distribution

An Improved D2GAN‐based oversampling algorithm for imbalanced data classification

Generative adversarial minority enlargement—A local linear over-sampling synthetic method