Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice

Flavio Hafner,Chang Sun
2024-11-19
Abstract:Synthetic data generators, when trained using privacy-preserving techniques like differential privacy, promise to produce synthetic data with formal privacy guarantees, facilitating the sharing of sensitive data. However, it is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies. This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models. Then, this paper explores the practical challenges for privacy evaluations of generative models for use cases with millions of training records, such as data from statistical agencies and healthcare providers. Our findings indicate that methods designed to verify the correct operation of the training algorithm are effective for large datasets, but they often assume an adversary that is unrealistic in many scenarios. Based on the findings, we highlight a crucial trade-off between the computational feasibility of the evaluation and the level of realism of the assumed threat model. Finally, we conclude with ideas and suggestions for future research.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the practical effects of generative and predictive machine - learning models in terms of privacy protection, especially for synthetic data generators trained using Differential Privacy (DP) techniques. Specifically, the paper focuses on the following aspects: 1. **Trust issue**: Before deploying synthetic data generators in practice, the trust of data owners, such as statistical agencies and healthcare providers, must be obtained. These stakeholders may be confused by the complexity of the DP - SGD algorithm, so intuitive and easy - to - understand explanations and demonstrations are required to prove its effectiveness. 2. **Countering real - world attacks**: The DP - SGD - assumed adversary does not have the ability to access auxiliary datasets, which may lead to an increased privacy risk in reality. Therefore, additional evaluations are needed to test possible vulnerabilities in real - world scenarios, for example, an adversary may use external datasets to reconstruct sensitive information. 3. **Gap between theory and practice**: The theoretical boundaries of DP - SGD are derived under the assumption of a highly capable adversary, but these assumptions are often too conservative in practical applications. Empirical evaluation can bridge this gap, assess privacy leakage under actual conditions, and help stakeholders better balance privacy and practicality. 4. **Lack of a standardized evaluation framework**: Currently, there is a lack of unified standards and methods for evaluating privacy - protection models (including generative and predictive models), making it difficult to compare the results between different studies. Therefore, it is very necessary to establish a standardized evaluation framework. 5. **Requirement for computational resources**: Conducting large - scale privacy tests requires a large amount of computational resources, which poses a challenge to practical applications. Therefore, more efficient evaluation methods need to be explored to reduce computational costs and improve feasibility. In general, this paper aims to quantify the privacy - protection capabilities of generative and predictive machine - learning models through empirical evaluation methods, especially when dealing with large - scale datasets containing millions of records, explore how to strike a balance between computational feasibility and the authenticity of assumptions, and provide guidance and suggestions for future privacy - evaluation research.