Abstract:A large amount of personal health data that is highly valuable to the scientific community is still not accessible or requires a lengthy request process due to privacy concerns and legal restrictions. As a solution, synthetic data has been studied and proposed to be a promising alternative to this issue. However, generating realistic and privacy-preserving synthetic personal health data retains challenges such as simulating the characteristics of the patients' data that are in the minority classes, capturing the relations among variables in imbalanced data and transferring them to the synthetic data, and preserving individual patients' privacy. In this paper, we propose a differentially private conditional Generative Adversarial Network model (DP-CGANS) consisting of data transformation, sampling, conditioning, and network training to generate realistic and privacy-preserving personal data. Our model distinguishes categorical and continuous variables and transforms them into latent space separately for better training performance. We tackle the unique challenges of generating synthetic patient data due to the special data characteristics of personal health data. For example, patients with a certain disease are typically the minority in the dataset and the relations among variables are crucial to be observed. Our model is structured with a conditional vector as an additional input to present the minority class in the imbalanced data and maximally capture the dependency between variables. Moreover, we inject statistical noise into the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on personal socio-economic datasets and real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing the dependence between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structures and characteristics of real-world personal health data such as imbalanced classes, abnormal distributions, and data sparsity.

SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy

Differentially Private Language Models for Secure Data Sharing

Synthetic Query Generation for Privacy-Preserving Deep Retrieval Systems using Differentially Private Language Models

Harnessing large-language models to generate private synthetic text

Differentially Private Tabular Data Synthesis using Large Language Models

PrivSyn: Differentially Private Data Synthesis

Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Private prediction for large-scale synthetic text generation

Differentially Private Synthetic Data Generation via Lipschitz-Regularised Variational Autoencoders

Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains

Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy

Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Quantifying and Mitigating Privacy Risks for Tabular Generative Models

Differentially Private Knowledge Distillation via Synthetic Text Generation

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Bounding the Excess Risk for Linear Models Trained on Marginal-Preserving, Differentially-Private, Synthetic Data

Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data