PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining

Kecen Li,Chen Gong,Zhixiang Li,Yuzhong Zhao,Xinwen Hou,Tianhao Wang
2024-10-08
Abstract:Differential Privacy (DP) image data synthesis, which leverages the DP technique to generate synthetic data to replace the sensitive data, allowing organizations to share and utilize synthetic images without privacy concerns. Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data, but suffer from problems of unstable training and massive computational resource demands. This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data, promoting the efficient creation of DP datasets with high fidelity and utility. PRIVIMAGE first establishes a semantic query function using a public dataset. Then, this function assists in querying the semantic distribution of the sensitive dataset, facilitating the selection of data from the public dataset with analogous semantics for pre-training. Finally, we pre-train an image generative model using the selected data and then fine-tune this model on the sensitive dataset using Differentially Private Stochastic Gradient Descent (DP-SGD). PRIVIMAGE allows us to train a lightly parameterized generative model, reducing the noise in the gradient during DP-SGD training and enhancing training stability. Extensive experiments demonstrate that PRIVIMAGE uses only 1% of the public dataset for pre-training and 7.6% of the parameters in the generative model compared to the state-of-the-art method, whereas achieves superior synthetic performance and conserves more computational resources. On average, PRIVIMAGE achieves 30.1% lower FID and 12.6% higher Classification Accuracy than the state-of-the-art method. The replication package and datasets can be accessed online.
Computer Vision and Pattern Recognition,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to generate high - quality synthetic images while protecting privacy. Specifically, the paper proposes a new method named PRIVIMAGE, which aims to use Differential Privacy (DP) technology to generate synthetic images to replace sensitive data, thus allowing organizations to share and use these synthetic images without worrying about privacy issues. ### Problem Background As deep - learning models rely more and more on sensitive personal data during the training process, especially in fields such as medicine, finance, and social networks, how to protect user privacy has become an important issue. Although existing methods can generate images with differential privacy to a certain extent, they have problems of unstable training and excessive computational resource requirements. ### Shortcomings of Existing Methods 1. **Unstable Training**: When generating high - quality synthetic images, existing methods are prone to unstable training. 2. **High Computational Resource Requirements**: In order to generate high - quality synthetic images, existing methods require a large amount of computational resources, which limits their practical applications. ### PRIVIMAGE's Solutions PRIVIMAGE proposes a new differential - privacy image - synthesis method and solves the above problems through the following steps: 1. **Establishment of Semantic Query Function**: First, PRIVIMAGE uses a public data set to establish a semantic query function that can extract the semantic information of images. 2. **Query and Selection of Semantic Distribution**: Then, use this semantic query function to query the semantic distribution of the sensitive data set, and select semantically similar data from the public data set for pre - training according to the query results. 3. **Pre - training and Fine - tuning**: Finally, PRIVIMAGE pre - trains the image - generation model on the selected public data set and uses Differential Privacy Stochastic Gradient Descent (DP - SGD) for fine - tuning on the sensitive data set. ### Main Contributions 1. **Analysis of the Importance of Semantic Distribution**: The paper points out that the semantic distribution of the public data set used for pre - training should be similar to that of the sensitive data set, which is the key to improving the quality and efficiency of synthetic images. 2. **Proposal of PRIVIMAGE**: PRIVIMAGE carefully selects pre - training data using the semantic distribution of the sensitive data set, making the generated synthetic images have higher fidelity and practicality. 3. **Significant Savings in Computational Resources**: Experiments show that PRIVIMAGE only needs to use 1% of the public data set for pre - training, and the number of parameters of the generation model used is only 7.6% of that of existing methods, but it is superior to existing methods in synthesis performance. ### Experimental Results - **FID Metric**: The FID value of PRIVIMAGE is 6.8% lower than that of existing methods. - **Classification Accuracy**: The classification accuracy of PRIVIMAGE is 13.2% higher than that of existing methods. - **Computational Resources**: The GPU memory and running time required by PRIVIMAGE are 50% and 48% lower than those of existing methods respectively. In conclusion, PRIVIMAGE significantly improves the quality and efficiency of differential - privacy image synthesis while greatly reducing the computational resource requirements by carefully selecting pre - training data and optimizing the training process of the generation model.