GANs in the Panorama of Synthetic Data Generation Methods

Bruno Vaz,Álvaro Figueira
DOI: https://doi.org/10.1145/3657294
2024-04-10
Abstract:This paper focuses on the creation and evaluation of synthetic data to address the challenges of imbalanced datasets in machine learning applications (ML), using fake news detection as a case study. We conducted a thorough literature review on generative adversarial networks (GANs) for tabular data, synthetic data generation methods, and synthetic data quality assessment. By augmenting a public news dataset with synthetic data generated by different GAN architectures, we demonstrate the potential of synthetic data to improve ML models’ performance in fake news detection. Our results show a significant improvement in classification performance, especially in the underrepresented class. We also modify and extend a data usage approach to evaluate the quality of synthetic data and investigate the relationship between synthetic data quality and data augmentation performance in classification tasks. We found a positive correlation between synthetic data quality and performance in the underrepresented class, highlighting the importance of high-quality synthetic data for effective data augmentation.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?