Gaussian Mixture Conditional Tabular Generative Adversarial Network for Data Imbalance Problem

Yongwei Ke,Jiali Cheng,Zhiqiang Cai
DOI: https://doi.org/10.1109/srse59585.2023.10336134
2023-01-01
Abstract:It is common for the collected data to have inconsistent numbers of some classes. The data imbalance problem causes machine learning algorithms in prediction tasks to encounter serious difficulties. To solve this issue, many effective oversampling algorithms have been proposed, but few methods pay attention to clustering analysis on data labels. In this paper, the two-stage oversampling method called Gaussian Mixture Conditional Tabular Generative Adversarial Network (GMM_CTGAN) improved based on Conditional Tabular Generative Adversarial Network (CTGAN) with the Gaussian Mixture Model (GMM) is proposed. Firstly, GMM is used as a clustering algorithm to divide the original dataset into multiple subsets. Secondly, CTGAN generates synthetic data for each class independently. Eventually, the synthetic data of all classes and original data are united to form the final training dataset. The experimental results reveal our proposed method shows more excellent performance than others and effectively solves the data imbalance problem.
What problem does this paper attempt to address?