An intra-class distribution-focused generative adversarial network approach for imbalanced tabular data learning

Qiuling Chen,Ayong Ye,Yuexin Zhang,Jianwei Chen,Chuan Huang
DOI: https://doi.org/10.1007/s13042-023-02048-5
2024-01-04
International Journal of Machine Learning and Cybernetics
Abstract:Data imbalance is a critical factor that adversely affects the performance of machine learning algorithms. It leads to deviations in decision boundaries, resulting in biased predictions towards the majority class and inaccurate classification of the minority class. Although oversampling the minority class using deep generative models is a popular strategy, many existing methods focus solely on enhancing data for the minority class while overlooking the distribution relationship within and between classes. Therefore, we propose an oversampling method that merges unsupervised clustering and generative adversarial network (GAN) to facilitate the imbalanced tabular data learning. First, we perform preprocessing (clustering) on the original data, remove clusters that do not require sampling and generate more samples for sparsely distributed minority class clusters to achieve sample balance within the minority class. Moreover, we design a CTGAN-based auxiliary classifier GAN (ACCTGAN) to generate the minority class. It enhances the semantic integrity of the synthetic data and avoids generating noisy samples. We conducted validation experiments comparing our approach to 7 typical methods on 12 real tabular datasets. Our method shows excellent performance in F1-measure and area under the curve (AUC), obtaining 19 and 20 best results on the three classifiers, respectively. It significantly enhances classification results and demonstrates good robustness and stability.
computer science, artificial intelligence
What problem does this paper attempt to address?