Abstract:Synthetic data generators, when trained using privacy-preserving techniques like differential privacy, promise to produce synthetic data with formal privacy guarantees, facilitating the sharing of sensitive data. However, it is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies. This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models. Then, this paper explores the practical challenges for privacy evaluations of generative models for use cases with millions of training records, such as data from statistical agencies and healthcare providers. Our findings indicate that methods designed to verify the correct operation of the training algorithm are effective for large datasets, but they often assume an adversary that is unrealistic in many scenarios. Based on the findings, we highlight a crucial trade-off between the computational feasibility of the evaluation and the level of realism of the assumed threat model. Finally, we conclude with ideas and suggestions for future research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the practical effects of generative and predictive machine - learning models in terms of privacy protection, especially for synthetic data generators trained using Differential Privacy (DP) techniques. Specifically, the paper focuses on the following aspects: 1. **Trust issue**: Before deploying synthetic data generators in practice, the trust of data owners, such as statistical agencies and healthcare providers, must be obtained. These stakeholders may be confused by the complexity of the DP - SGD algorithm, so intuitive and easy - to - understand explanations and demonstrations are required to prove its effectiveness. 2. **Countering real - world attacks**: The DP - SGD - assumed adversary does not have the ability to access auxiliary datasets, which may lead to an increased privacy risk in reality. Therefore, additional evaluations are needed to test possible vulnerabilities in real - world scenarios, for example, an adversary may use external datasets to reconstruct sensitive information. 3. **Gap between theory and practice**: The theoretical boundaries of DP - SGD are derived under the assumption of a highly capable adversary, but these assumptions are often too conservative in practical applications. Empirical evaluation can bridge this gap, assess privacy leakage under actual conditions, and help stakeholders better balance privacy and practicality. 4. **Lack of a standardized evaluation framework**: Currently, there is a lack of unified standards and methods for evaluating privacy - protection models (including generative and predictive models), making it difficult to compare the results between different studies. Therefore, it is very necessary to establish a standardized evaluation framework. 5. **Requirement for computational resources**: Conducting large - scale privacy tests requires a large amount of computational resources, which poses a challenge to practical applications. Therefore, more efficient evaluation methods need to be explored to reduce computational costs and improve feasibility. In general, this paper aims to quantify the privacy - protection capabilities of generative and predictive machine - learning models through empirical evaluation methods, especially when dealing with large - scale datasets containing millions of records, explore how to strike a balance between computational feasibility and the authenticity of assumptions, and provide guidance and suggestions for future privacy - evaluation research.

Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice

Generating Artificial Data for Private Deep Learning

Tunable Privacy Risk Evaluation of Generative Adversarial Networks

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

An Overview of Privacy in Machine Learning

Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data

On Utility and Privacy in Synthetic Genomic Data

Machine Learning for Synthetic Data Generation: A Review

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains

Boosting Data Analytics With Synthetic Volume Expansion

Differentially Private Synthetic Data Generation via Lipschitz-Regularised Variational Autoencoders

On the Challenges of Deploying Privacy-Preserving Synthetic Data in the Enterprise

Privacy-Preserving Synthetic Educational Data Generation

Evaluations of Machine Learning Privacy Defenses are Misleading

Comprehensive Exploration of Synthetic Data Generation: A Survey

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Systematic Evaluation of Privacy Risks of Machine Learning Models

Predictive privacy: towards an applied ethics of data analytics

Generating tabular datasets under differential privacy