Abstract:In recent years, the growth of data across various sectors, including healthcare, security, finance, and education, has created significant opportunities for analysis and informed decision-making. However, these datasets often contain sensitive and personal information, which raises serious privacy concerns. Protecting individual privacy is crucial, yet many existing machine learning and data publishing algorithms struggle with high-dimensional data, facing challenges related to computational efficiency and privacy preservation. To address these challenges, we introduce an effective data publishing algorithm \emph{DP-CDA}. Our proposed algorithm generates synthetic datasets by randomly mixing data in a class-specific manner, and inducing carefully-tuned randomness to ensure formal privacy guarantees. Our comprehensive privacy accounting shows that DP-CDA provides a stronger privacy guarantee compared to existing methods, allowing for better utility while maintaining strict level of privacy. To evaluate the effectiveness of DP-CDA, we examine the accuracy of predictive models trained on the synthetic data, which serves as a measure of dataset utility. Importantly, we identify an optimal order of mixing that balances privacy guarantee with predictive accuracy. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by traditional data publishing algorithms, even when subject to the same privacy requirements.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to efficiently generate synthetic datasets for machine - learning model training while protecting privacy. Specifically, with the growth of data volume in various fields (such as medical, security, finance, and education), these datasets often contain sensitive and personal information, which has led to serious privacy issues. Many existing machine - learning and data - publishing algorithms face challenges in computational efficiency and privacy protection when dealing with high - dimensional data. To solve these problems, the author proposes a new data - publishing algorithm - DP - CDA (Differentially Private Class - Centric Data Aggregation). This algorithm enhances privacy protection and improves data utility in the following ways: 1. **Randomly mix data of specific categories**: Randomly select multiple data samples from specific categories for mixing. 2. **Introduce carefully adjusted randomness**: Add Gaussian noise to the mixed data to ensure formal privacy guarantees. The key contributions of the paper include: - Proposing a more stringent privacy - guarantee analysis method. - Studying the influence of the mixing order \( l \) on the model performance and finding the optimal mixing order \( l^* \), so that the model achieves the best performance under a given dataset and privacy level. - Through theoretical analysis and experimental verification, it is proved that DP - CDA can maintain high data utility while providing stronger privacy protection, outperforming existing methods. The following are some important formulas involved in the paper: ### Definition of Differential Privacy The definition of differential privacy \((\epsilon, \delta)\)-DP is as follows: \[ \text{Pr}(f(D) \in S) \leq \delta + e^\epsilon \cdot \text{Pr}(f(D') \in S) \] where \( \epsilon > 0 \) and \( 0 < \delta < 1 \) are privacy parameters, which determine the trade - off between the privacy and utility provided by the algorithm. ### Gaussian Mechanism The definition of the Gaussian mechanism is as follows: \[ G_\sigma f(D) = f(D) + e, \quad e \sim N(0, \sigma^2 I_d) \] The condition for satisfying \((\epsilon, \delta)\)-DP is: \[ \sigma \geq \frac{\Delta}{\epsilon} \sqrt{2 \log \left(\frac{1.25}{\delta}\right)} \] ### Synthetic Data Generation Process The generation formula for the synthetic sample \( \tilde{x}^{(k)}_t \) is: \[ \tilde{x}^{(k)}_t = \frac{1}{l} \sum_{j = 1}^l x_{ij} + n_x, \quad n_x \sim N(0, \sigma_x^2 I_d) \] The generation formula for the corresponding synthetic label \( \tilde{y}^{(k)}_t \) is: \[ \tilde{y}^{\text{one - hot}(k)}_t = \frac{1}{l} \sum_{j = 1}^l y^{\text{one - hot}}_{ij} + n_y, \quad n_y \sim N(0, \sigma_y^2 I_K) \] Then convert the one - hot encoding to an integer label: \[ \tilde{y}^{(k)}_t = \arg \max_{i \in \{0, 1, \ldots, K - 1\}} \tilde{y}^{\text{one - hot}(k)}_t[i] \] In summary, this paper aims to generate high - quality synthetic datasets by introducing the DP - CDA algorithm while protecting privacy, thereby effectively supporting machine - learning tasks.

DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing

PrivSyn: Differentially Private Data Synthesis

Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown

Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data

DPMLBench: Holistic Evaluation of Differentially Private Machine Learning

Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder

Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets

Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets

A Novel Privacy Preserving Method for Data Publication

SoK: Privacy-Preserving Data Synthesis

Differentially Private Synthetic Data with Private Density Estimation

pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity

Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

Plausible deniability for privacy-preserving data synthesis

Noise-Aware Statistical Inference with Differentially Private Synthetic Data

A Data Synthesis Approach Based on Local Differential Privacy

CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

Improving Privacy and Utility in Aggregate Data: A Hybrid Approach