Abstract:Synthetic data provides a privacy protecting mechanism for the broad usage and sharing of healthcare data for secondary purposes. It is considered a safe approach for the sharing of sensitive data as it generates an artificial dataset that contains no identifiable information. Synthetic data is increasing in popularity with multiple synthetic data generators developed in the past decade, yet its utility is still a subject of research. This paper is concerned with evaluating the effect of various synthetic data generation and usage settings on the utility of the generated synthetic data and its derived models. Specifically, we investigate (i) the effect of data pre-processing on the utility of the synthetic data generated, (ii) whether tuning should be applied to the synthetic datasets when generating supervised machine learning models, and (iii) whether sharing preliminary machine learning results can improve the synthetic data models. Lastly, (iv) we investigate whether one utility measure (Propensity score) can predict the accuracy of the machine learning models generated from the synthetic data when employed in real life. We use two popular measures of synthetic data utility, propensity score and classification accuracy, to compare the different settings. We adopt a recent mechanism for the calculation of propensity, which looks carefully into the choice of model for the propensity score calculation. Accordingly, this paper takes a new direction with investigating the effect of various data generation and usage settings on the quality of the generated data and its ensuing models. The goal is to inform on the best strategies to follow when generating and using synthetic data.

Synthetic Data Approach for Classification and Regression

Machine Learning for Synthetic Data Generation: A Review

Boosting Data Analytics With Synthetic Volume Expansion

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

Synthetic Data for Model Selection

Tuberculosis of the small intestine.

One Step to Efficient Synthetic Data

On the Equivalency, Substitutability, and Flexibility of Synthetic Data

Utility Assessment of Synthetic Data Generation Methods

Best Practices and Lessons Learned on Synthetic Data

Synthetic data method to incorporate external information into a current study

Advancing Retail Data Science: Comprehensive Evaluation of Synthetic Data

Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Synthetic Data for Object Classification in Industrial Applications

Utility Theory of Synthetic Data Generation

Artificial Data, Real Insights: Evaluating Opportunities and Risks of Expanding the Data Ecosystem with Synthetic Data

Synthetic Data in Healthcare

SynSys: A Synthetic Data Generation System for Healthcare Applications

Synthetic Data in Human Analysis: A Survey