Abstract:The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how synthetic data has developed, been applied, and its methodological basis as a tool for expanding access to sensitive micro - data in the past three decades. Specifically, the paper reviews the historical development of synthetic data, discusses different synthetic strategies and various methods for measuring the utility and disclosure risk of generated data. The paper also emphasizes that in a data - driven world, the availability and storage of data have raised concerns about confidentiality and privacy, and the importance and application of synthetic data as a method to balance broad data access and disclosure protection are growing continuously. ### Main problems of the paper 1. **Historical development of synthetic data**: - Review the proposal of the concept of synthetic data and its early applications. - Discuss different development paths of synthetic data in the fields of statistics and computer science. 2. **Methodological basis of synthetic data**: - Introduce different types of synthetic data methods, such as fully synthetic data and partially synthetic data. - Discuss how to obtain valid inferences from multiple synthetic data sets, including combination rules and multivariate analysis. 3. **Utility and disclosure risk of synthetic data**: - Explore various strategies for measuring the utility and residual disclosure risk of synthetic data. - Analyze the performance of different methods in practical applications, especially the trade - off between protecting privacy and providing effective data access. 4. **Future development directions**: - Discuss how technologies such as verification servers can enhance the practicality of synthetic data. - Look forward to the potential of synthetic data in future research and applications. ### Specific problems - **Historical review**: The paper reviews in detail the proposal of the concept of synthetic data and its early applications in the fields of statistics and computer science, especially the development since Rubin and Little proposed the concept of synthetic data in 1993. - **Methodological basis**: Introduces the generation methods of synthetic data, including fully synthetic data and partially synthetic data, and discusses how to obtain valid statistical inferences from multiple synthetic data sets. - **Utility and risk assessment**: Explores various methods for measuring the utility and disclosure risk of synthetic data, including techniques based on differential privacy. - **Practical applications**: Lists practical application cases of synthetic data in various fields, such as healthcare, self - driving, etc. - **Future prospects**: Discusses the potential future development directions of synthetic data, including technological improvements and new application scenarios. Through the exploration of these problems, the paper aims to comprehensively review the development process of synthetic data, evaluate its current application effects, and look forward to its future development prospects.

30 Years of Synthetic Data

Artificial Data, Real Insights: Evaluating Opportunities and Risks of Expanding the Data Ecosystem with Synthetic Data

Advancing microdata privacy protection: A review of synthetic data methods

Synthetic Data: Methods, Use Cases, and Risks

To democratize research with sensitive data, we should make synthetic data more accessible

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

Synthetic Census Microdata Generation: A Comparative Study of Synthesis Methods Examining the Trade-Off Between Disclosure Risk and Utility

Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method

Synthetic data in health care: A narrative review

Getting real about synthetic data ethics

Multiply-Imputed Synthetic Data: Advice to the Imputer

Boosting Data Analytics With Synthetic Volume Expansion

A primer on synthetic health data

GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using Macro Data Sources

Enabling Synthetic Data adoption in regulated domains

Synthetic data generation for a longitudinal cohort study -- Evaluation, method extension and reproduction of published data analysis results

Synthetic data generation for a longitudinal cohort study – evaluation, method extension and reproduction of published data analysis results

Privacy risk from synthetic data: practical proposals

Synthetic data & the future of Women's Health: A synergistic relationship

Synthetic data in biomedicine via generative artificial intelligence