Abstract:According to a report published by Gartner in 2021, a significant portion of Machine Learning (ML) training data will be artificially generated. This development has led to the emergence of various synthetic data generators (SDGs), particularly those based on Generative Adversarial Networks (GAN). All research endeavors so far have been exploratory, focused on specific objectives such as validating utility or disclosure control or assessing how generators can decrease or increase inherent bias with differential privacy. Hence, we aim to empirically identify an AI-based, data generator that can produce datasets that closely resemble real datasets, while also determining the hyper-parameters that enable a satisfactory balance between utility, privacy, and fairness in the datasets. To achieve this, we utilize the Synthetic Data Vault, Data Synthesizer, and Smartnoise-synth, which are three synthetic data generation packages that are accessible via Python. Different data generation models available within the package are presented with 13 tabular datasets iteratively as sample inputs to generate synthetic data. We generated synthetic data using every dataset and generator and investigated the goodness of the generator using five hypothetical scenarios. The utility and privacy offered by the generated data were compared with those of real data. The fairness in the ML model trained with synthetic data was used as a third metric for evaluation. Finally, we employ synthetic data to train regression and classification Machine Learning (ML) algorithms and evaluate their performance. After conducting experiments, analyzing metrics, and comparing ML scores across all 11 generators, we determined that the CTGAN from SDV and PATECTGAN from the SN-synth package were the most effective in mimicking real data for all 13 datasets utilized in our research.

GANs in the Panorama of Synthetic Data Generation Methods

Survey on Synthetic Data Generation, Evaluation Methods and GANs

Data Augmentation Using GANs

Analyzing Effects of Fake Training Data on the Performance of Deep Learning Systems

Comprehensive Exploration of Synthetic Data Generation: A Survey

Machine Learning for Synthetic Data Generation: A Review

Generative Adversarial Networks for Synthetic Data Generation: A Comparative Study

Augmenting data with generative adversarial networks: An overview

Exploring Innovative Approaches to Synthetic Tabular Data Generation

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

FairGen: Fair Synthetic Data Generation

Tabular Data Synthesis with Generative Adversarial Networks: Design Space and Optimizations

zGAN: An Outlier-focused Generative Adversarial Network For Realistic Synthetic Data Generation

FakeNews: GAN-based generation of realistic 3D volumetric data -- A systematic review and taxonomy

A Methodology and an Empirical Analysis to Determine the Most Suitable Synthetic Data Generator

Survey on Generative Adversarial Behavior in Artificial Neural Tasks

Evaluating the Utility of GAN Generated Synthetic Tabular Data for Class Balancing and Low Resource Settings

Synthetic Data Generation for Fraud Detection using GANs

Synthetic data in biomedicine via generative artificial intelligence

Data Synthesis based on Generative Adversarial Networks