Abstract:Differentially private synthetic data generation (DP-SDG) algorithms are used to release datasets that are structurally and statistically similar to sensitive data while providing formal bounds on the information they leak. However, bugs in algorithms and implementations may cause the actual information leakage to be higher. This prompts the need to verify whether the theoretical guarantees of state-of-the-art DP-SDG implementations also hold in practice. We do so via a rigorous auditing process: we compute the information leakage via an adversary playing a distinguishing game and running membership inference attacks (MIAs). If the leakage observed empirically is higher than the theoretical bounds, we identify a DP violation; if it is non-negligibly lower, the audit is loose.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to verify whether the privacy - protection effect of the latest differential privacy synthetic data generation (DP - SDG) algorithm in practical applications conforms to its theoretical guarantees. Specifically, the author focuses on: 1. **Implementation issues of differential privacy**: Although differential privacy can theoretically provide strict privacy protection, in actual implementation, due to errors in algorithms and implementation, the actual amount of information leakage may be higher than the theoretical upper limit. Therefore, a method is needed to verify whether these implementations actually provide the promised privacy protection. 2. **Effectiveness of auditing methods**: Existing auditing methods usually rely on black - box attacks, that is, the attacker has only the right to access synthetic data, but not the right to access the internal parameters of the training model. However, these methods may not be able to accurately estimate the actual privacy leakage situation. Therefore, more powerful attack models (such as white - box attacks) need to be explored to obtain a more accurate privacy leakage estimate. 3. **Impact of the worst - case data set**: In order to ensure the rigor of the audit results, it is necessary to consider not only the average - case data set, but also the worst - case data set. Because some specific data sets may maximize privacy leakage, thereby revealing potential privacy vulnerabilities. ### Main research questions of the paper 1. **How to closely estimate the privacy leakage in DP - SDG?** 2. **What is the impact of different threat models and data sets on the tightness of privacy leakage?** ### Experimental design and main findings To solve the above problems, the author designed an experimental framework to audit multiple DP - SDG implementations using different MIAs (membership inference attacks) and considered different data sets and threat models. The main findings include: - **Common black - box attacks (such as the nearest record distance (DCR) heuristic) are ineffective in DP - SDG**. - **White - box and active white - box attacks provide a more accurate privacy leakage estimate, especially when using carefully designed worst - case data sets**. - **The optimal audit settings for different implementations may be different**. For example, passive white - box auditing is effective for PrivBayes and MST, while DPWGAN requires active white - box attacks. - **Known DP violations in four implementations were found, and new violations were found in a new implementation (DPWGAN)**. ### Contributions 1. **For the first time, a large - scale audit of the DP - SDG algorithm and its implementation was carried out**. 2. **Created the worst - case data set for specific implementations of DP - SDG, making the audit more rigorous**. 3. **For the first time, white - box MIAs for PrivBayes and MST were proposed**. Through these contributions, this paper significantly improves the understanding of the actual privacy protection effect of the DP - SDG algorithm and provides an important reference for future research.

"What do you want from theory alone?" Experimenting with Tight Auditing of Differentially Private Synthetic Data Generation

Nearly Tight Black-Box Auditing of Differentially Private Machine Learning

Privacy Vulnerabilities in Marginals-based Synthetic Data

Tighter Privacy Auditing of DP-SGD in the Hidden State Threat Model

Tight Auditing of Differentially Private Machine Learning

Auditing Private Prediction

Auditing $f$-Differential Privacy in One Run

Auditing Differential Privacy Guarantees Using Density Estimation

A General Framework for Auditing Differentially Private Machine Learning

Revealing the True Cost of Locally Differentially Private Protocols: An Auditing Perspective

To Shuffle or not to Shuffle: Auditing DP-SGD with Shuffling

Closed-Form Bounds for DP-SGD against Record-level Inference

Auditing Differentially Private Machine Learning: How Private is Private SGD?

Debugging Differential Privacy: A Case Study for Privacy Auditing

Synthesizing Tight Privacy and Accuracy Bounds via Weighted Model Counting

Adversarial Sample-Based Approach for Tighter Privacy Auditing in Final Model-Only Scenarios

DPMLBench: Holistic Evaluation of Differentially Private Machine Learning

Securely Sampling Discrete Gaussian Noise for Multi-Party Differential Privacy

The Last Iterate Advantage: Empirical Auditing and Principled Heuristic Analysis of Differentially Private SGD

PrivSyn: Differentially Private Data Synthesis

Unleashing the Power of Randomization in Auditing Differentially Private ML