Abstract:According to a report published by Gartner in 2021, a significant portion of Machine Learning (ML) training data will be artificially generated. This development has led to the emergence of various synthetic data generators (SDGs), particularly those based on Generative Adversarial Networks (GAN). All research endeavors so far have been exploratory, focused on specific objectives such as validating utility or disclosure control or assessing how generators can decrease or increase inherent bias with differential privacy. Hence, we aim to empirically identify an AI-based, data generator that can produce datasets that closely resemble real datasets, while also determining the hyper-parameters that enable a satisfactory balance between utility, privacy, and fairness in the datasets. To achieve this, we utilize the Synthetic Data Vault, Data Synthesizer, and Smartnoise-synth, which are three synthetic data generation packages that are accessible via Python. Different data generation models available within the package are presented with 13 tabular datasets iteratively as sample inputs to generate synthetic data. We generated synthetic data using every dataset and generator and investigated the goodness of the generator using five hypothetical scenarios. The utility and privacy offered by the generated data were compared with those of real data. The fairness in the ML model trained with synthetic data was used as a third metric for evaluation. Finally, we employ synthetic data to train regression and classification Machine Learning (ML) algorithms and evaluate their performance. After conducting experiments, analyzing metrics, and comparing ML scores across all 11 generators, we determined that the CTGAN from SDV and PATECTGAN from the SN-synth package were the most effective in mimicking real data for all 13 datasets utilized in our research.

Synthetic Data Generator for Classification Rules Learning

Synthetic Data Approach for Classification and Regression

Data Generators for Learning Systems Based on RBF Networks

High-Level Synthetic Data Generation with Data Set Archetypes

Synthetic Data for Object Classification in Industrial Applications

Synthetic Data for Model Selection

Comparing Synthetic Tabular Data Generation Between a Probabilistic Model and a Deep Learning Model for Education Use Cases

A Methodology and an Empirical Analysis to Determine the Most Suitable Synthetic Data Generator

Machine Learning for Synthetic Data Generation: A Review

Induction of classification rules by Gini-index based rule generation

Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Comprehensive Exploration of Synthetic Data Generation: A Survey

Utility Theory of Synthetic Data Generation

DataSynth: generating synthetic data using declarative constraints

On the Equivalency, Substitutability, and Flexibility of Synthetic Data

Synthetic data generation method for data-free knowledge distillation in regression neural networks

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Enhancing Table Representations with LLM-powered Synthetic Data Generation

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation