30 Years of Synthetic Data

Joerg Drechsler,Anna-Carolina Haensch
2023-04-05
Abstract:The idea to generate synthetic data as a tool for broadening access to sensitive microdata has been proposed for the first time three decades ago. While first applications of the idea emerged around the turn of the century, the approach really gained momentum over the last ten years, stimulated at least in parts by some recent developments in computer science. We consider the upcoming 30th jubilee of Rubin's seminal paper on synthetic data (Rubin, 1993) as an opportunity to look back at the historical developments, but also to offer a review of the diverse approaches and methodological underpinnings proposed over the years. We will also discuss the various strategies that have been suggested to measure the utility and remaining risk of disclosure of the generated data.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how synthetic data has developed, been applied, and its methodological basis as a tool for expanding access to sensitive micro - data in the past three decades. Specifically, the paper reviews the historical development of synthetic data, discusses different synthetic strategies and various methods for measuring the utility and disclosure risk of generated data. The paper also emphasizes that in a data - driven world, the availability and storage of data have raised concerns about confidentiality and privacy, and the importance and application of synthetic data as a method to balance broad data access and disclosure protection are growing continuously. ### Main problems of the paper 1. **Historical development of synthetic data**: - Review the proposal of the concept of synthetic data and its early applications. - Discuss different development paths of synthetic data in the fields of statistics and computer science. 2. **Methodological basis of synthetic data**: - Introduce different types of synthetic data methods, such as fully synthetic data and partially synthetic data. - Discuss how to obtain valid inferences from multiple synthetic data sets, including combination rules and multivariate analysis. 3. **Utility and disclosure risk of synthetic data**: - Explore various strategies for measuring the utility and residual disclosure risk of synthetic data. - Analyze the performance of different methods in practical applications, especially the trade - off between protecting privacy and providing effective data access. 4. **Future development directions**: - Discuss how technologies such as verification servers can enhance the practicality of synthetic data. - Look forward to the potential of synthetic data in future research and applications. ### Specific problems - **Historical review**: The paper reviews in detail the proposal of the concept of synthetic data and its early applications in the fields of statistics and computer science, especially the development since Rubin and Little proposed the concept of synthetic data in 1993. - **Methodological basis**: Introduces the generation methods of synthetic data, including fully synthetic data and partially synthetic data, and discusses how to obtain valid statistical inferences from multiple synthetic data sets. - **Utility and risk assessment**: Explores various methods for measuring the utility and disclosure risk of synthetic data, including techniques based on differential privacy. - **Practical applications**: Lists practical application cases of synthetic data in various fields, such as healthcare, self - driving, etc. - **Future prospects**: Discusses the potential future development directions of synthetic data, including technological improvements and new application scenarios. Through the exploration of these problems, the paper aims to comprehensively review the development process of synthetic data, evaluate its current application effects, and look forward to its future development prospects.