Privacy-Preserving Student Learning with Differentially Private Data-Free Distillation

Bochao Liu,Jianghu Lu,Pengju Wang,Junjie Zhang,Dan Zeng,Zhenxing Qian,Shiming Ge
2024-09-19
Abstract:Deep learning models can achieve high inference accuracy by extracting rich knowledge from massive well-annotated data, but may pose the risk of data privacy leakage in practical deployment. In this paper, we present an effective teacher-student learning approach to train privacy-preserving deep learning models via differentially private data-free distillation. The main idea is generating synthetic data to learn a student that can mimic the ability of a teacher well-trained on private data. In the approach, a generator is first pretrained in a data-free manner by incorporating the teacher as a fixed discriminator. With the generator, massive synthetic data can be generated for model training without exposing data privacy. Then, the synthetic data is fed into the teacher to generate private labels. Towards this end, we propose a label differential privacy algorithm termed selective randomized response to protect the label information. Finally, a student is trained on the synthetic data with the supervision of private labels. In this way, both data privacy and label privacy are well protected in a unified framework, leading to privacy-preserving models. Extensive experiments and analysis clearly demonstrate the effectiveness of our approach.
Machine Learning,Artificial Intelligence,Cryptography and Security,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to train high - performance deep - learning models without disclosing data privacy. Specifically, the paper proposes a differentially private data - free distillation method (Differentially Private Data - Free Distillation, DP - DFD). By generating synthetic data to train the student model, the student model can imitate the ability of the teacher model without directly accessing private data. This method not only protects data privacy but also label privacy, thus achieving privacy protection without sacrificing model accuracy. ### Main problems and solutions: 1. **Data privacy leakage**: Traditional deep - learning models require a large amount of well - labeled data during the training process, and this data may contain sensitive information. Once the model is released, it may lead to data privacy leakage. - **Solution**: The paper proposes a data - free distillation method. By generating synthetic data to train the student model, it avoids directly using private data. 2. **Label privacy leakage**: Even if the data itself is not leaked, the model may also leak privacy information when generating labels. - **Solution**: The paper introduces a selective randomized response algorithm (Selective Randomized Response) to perform differentially private processing on the labels generated by the teacher model, further protecting label privacy. ### Method overview: - **Generator training**: First, use private data to train a teacher model, and then use this teacher model as a fixed discriminator to train a generator in a data - free manner. The synthetic data generated by the generator is used for subsequent student model training. - **Student model training**: The generated synthetic data is input into the teacher model to generate labels with differential privacy protection. The student model is trained on these synthetic data and their corresponding labels, thereby learning the knowledge of the teacher model. ### Main contributions: 1. Propose a differentially private data - free distillation method. Train the student model with synthetic data while protecting data and label privacy. 2. Introduce a selective randomized response algorithm, which effectively protects label privacy and improves the learning effect of the student model. 3. Verify the effectiveness of the method through extensive experiments and demonstrate its superior performance on different datasets. ### Experimental results: - Experimental results on multiple datasets (such as MNIST, FashionMNIST, CIFAR10, CIFAR100, CelebA) show that this method performs well under different privacy budgets, especially having significant advantages on high - dimensional datasets. - Compared with existing data - sensitive and label - sensitive methods, this method provides stronger privacy protection while maintaining high model accuracy. In conclusion, the method proposed in this paper effectively trains high - performance student models while protecting data and label privacy, providing a new solution for deep - learning model training under privacy protection.