Evaluating the Utility of GAN Generated Synthetic Tabular Data for Class Balancing and Low Resource Settings

Nagarjuna Chereddy,Bharath Kumar Bolla
DOI: https://doi.org/10.1007/978-3-031-36402-0_4
2023-06-24
Abstract:The present study aimed to address the issue of imbalanced data in classification tasks and evaluated the suitability of SMOTE, ADASYN, and GAN techniques in generating synthetic data to address the class imbalance and improve the performance of classification models in low-resource settings. The study employed the Generalised Linear Model (GLM) algorithm for class balancing experiments and the Random Forest (RF) algorithm for low-resource setting experiments to assess model performance under varying training data. The recall metric was the primary evaluation metric for all classification models. The results of the class balancing experiments showed that the GLM model trained on GAN-balanced data achieved the highest recall value. Similarly, in low-resource experiments, models trained on data enhanced with GAN-synthesized data exhibited better recall values than original data. These findings demonstrate the potential of GAN-generated synthetic data for addressing the challenge of imbalanced data in classification tasks and improving model performance in low-resource settings.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of data imbalance in classification tasks and evaluates the applicability of SMOTE, ADASYN, and GAN techniques in generating synthetic data to solve class imbalance problems and improve the performance of classification models in low-resource environments. Specifically, the study employs the Generalized Linear Model (GLM) algorithm for class balance experiments and the Random Forest (RF) algorithm for experiments in low-resource environments to assess model performance under different training data conditions. The primary evaluation metric of the study is recall. The experimental results indicate that the GLM model trained with balanced data generated by GAN achieved the highest recall value; similarly, in low-resource experiments, models trained with datasets enhanced by GAN synthetic data also exhibited higher recall values compared to the original data. These findings demonstrate the potential of GAN-generated synthetic data in addressing data imbalance issues in classification tasks and improving model performance in low-resource environments.