DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing

Utsab Saha,Tanvir Muntakim Tonoy,Hafiz Imtiaz
2024-11-25
Abstract:In recent years, the growth of data across various sectors, including healthcare, security, finance, and education, has created significant opportunities for analysis and informed decision-making. However, these datasets often contain sensitive and personal information, which raises serious privacy concerns. Protecting individual privacy is crucial, yet many existing machine learning and data publishing algorithms struggle with high-dimensional data, facing challenges related to computational efficiency and privacy preservation. To address these challenges, we introduce an effective data publishing algorithm \emph{DP-CDA}. Our proposed algorithm generates synthetic datasets by randomly mixing data in a class-specific manner, and inducing carefully-tuned randomness to ensure formal privacy guarantees. Our comprehensive privacy accounting shows that DP-CDA provides a stronger privacy guarantee compared to existing methods, allowing for better utility while maintaining strict level of privacy. To evaluate the effectiveness of DP-CDA, we examine the accuracy of predictive models trained on the synthetic data, which serves as a measure of dataset utility. Importantly, we identify an optimal order of mixing that balances privacy guarantee with predictive accuracy. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by traditional data publishing algorithms, even when subject to the same privacy requirements.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to efficiently generate synthetic datasets for machine - learning model training while protecting privacy. Specifically, with the growth of data volume in various fields (such as medical, security, finance, and education), these datasets often contain sensitive and personal information, which has led to serious privacy issues. Many existing machine - learning and data - publishing algorithms face challenges in computational efficiency and privacy protection when dealing with high - dimensional data. To solve these problems, the author proposes a new data - publishing algorithm - DP - CDA (Differentially Private Class - Centric Data Aggregation). This algorithm enhances privacy protection and improves data utility in the following ways: 1. **Randomly mix data of specific categories**: Randomly select multiple data samples from specific categories for mixing. 2. **Introduce carefully adjusted randomness**: Add Gaussian noise to the mixed data to ensure formal privacy guarantees. The key contributions of the paper include: - Proposing a more stringent privacy - guarantee analysis method. - Studying the influence of the mixing order \( l \) on the model performance and finding the optimal mixing order \( l^* \), so that the model achieves the best performance under a given dataset and privacy level. - Through theoretical analysis and experimental verification, it is proved that DP - CDA can maintain high data utility while providing stronger privacy protection, outperforming existing methods. The following are some important formulas involved in the paper: ### Definition of Differential Privacy The definition of differential privacy \((\epsilon, \delta)\)-DP is as follows: \[ \text{Pr}(f(D) \in S) \leq \delta + e^\epsilon \cdot \text{Pr}(f(D') \in S) \] where \( \epsilon > 0 \) and \( 0 < \delta < 1 \) are privacy parameters, which determine the trade - off between the privacy and utility provided by the algorithm. ### Gaussian Mechanism The definition of the Gaussian mechanism is as follows: \[ G_\sigma f(D) = f(D) + e, \quad e \sim N(0, \sigma^2 I_d) \] The condition for satisfying \((\epsilon, \delta)\)-DP is: \[ \sigma \geq \frac{\Delta}{\epsilon} \sqrt{2 \log \left(\frac{1.25}{\delta}\right)} \] ### Synthetic Data Generation Process The generation formula for the synthetic sample \( \tilde{x}^{(k)}_t \) is: \[ \tilde{x}^{(k)}_t = \frac{1}{l} \sum_{j = 1}^l x_{ij} + n_x, \quad n_x \sim N(0, \sigma_x^2 I_d) \] The generation formula for the corresponding synthetic label \( \tilde{y}^{(k)}_t \) is: \[ \tilde{y}^{\text{one - hot}(k)}_t = \frac{1}{l} \sum_{j = 1}^l y^{\text{one - hot}}_{ij} + n_y, \quad n_y \sim N(0, \sigma_y^2 I_K) \] Then convert the one - hot encoding to an integer label: \[ \tilde{y}^{(k)}_t = \arg \max_{i \in \{0, 1, \ldots, K - 1\}} \tilde{y}^{\text{one - hot}(k)}_t[i] \] In summary, this paper aims to generate high - quality synthetic datasets by introducing the DP - CDA algorithm while protecting privacy, thereby effectively supporting machine - learning tasks.