Abstract:Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {É}pid{é}miologie et Sant{é} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation.

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

Bayesian Estimation of Attribute Disclosure Risks in Synthetic Data with the $\texttt{AttributeRiskCalculation}$ R Package

Data Privacy Protection and Utility Preservation through Bayesian Data Synthesis: A Case Study on Airbnb Listings

Using saturated count models for user-friendly synthesis of categorical data

Synthetic Census Microdata Generation: A Comparative Study of Synthesis Methods Examining the Trade-Off Between Disclosure Risk and Utility

Private Tabular Survey Data Products Through Synthetic Microdata Generation

Generation and analysis of synthetic data via Bayesian networks: a robust approach for uncertainty quantification via Bayesian paradigm

Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data

Assessing Statistical Disclosure Risk for Differentially Private, Hierarchical Count Data, with Application to the 2020 U.S. Decennial Census

Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method

Disclosure risk assessment with Bayesian non-parametric hierarchical modelling

Practical privacy metrics for synthetic data

Statistical disclosure control for numeric microdata via sequential joint probability preserving data shuffling

Differentially Private Verification of Survey-Weighted Estimates

Balancing data privacy and usability in the federal statistical system

Quantifying Privacy Risks of Public Statistics to Residents of Subsidized Housing

A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications

A Tutorial in Assessing Disclosure Risk in Microdata

Advancing microdata privacy protection: A review of synthetic data methods

Two-Phase Data Synthesis for Income: An Application to the NHIS