A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness

Zongxiong Chen,Jiahui Geng,Derui Zhu,Herbert Woisetschlaeger,Qing Li,Sonja Schimmler,Ruben Mayer,Chunming Rong
DOI: https://doi.org/10.48550/arXiv.2305.03355
2023-05-27
Abstract:The aim of dataset distillation is to encode the rich features of an original dataset into a tiny dataset. It is a promising approach to accelerate neural network training and related studies. Different approaches have been proposed to improve the informativeness and generalization performance of distilled images. However, no work has comprehensively analyzed this technique from a security perspective and there is a lack of systematic understanding of potential risks. In this work, we conduct extensive experiments to evaluate current state-of-the-art dataset distillation methods. We successfully use membership inference attacks to show that privacy risks still remain. Our work also demonstrates that dataset distillation can cause varying degrees of impact on model robustness and amplify model unfairness across classes when making predictions. This work offers a large-scale benchmarking framework for dataset distillation evaluation.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate and analyze the impact of dataset distillation techniques in terms of performance, privacy, robustness, and fairness. Specifically: 1. **Can synthetic datasets be used instead of real datasets to protect data privacy?** - The authors test this through membership inference attacks (MIAs) and find that dataset distillation does not have an inherent privacy - protecting ability. Its privacy risks are related to factors such as the distillation rate, initialization method, and number of classes. 2. **When the model is trained on the distilled dataset, what is the impact of visual noise on the model's robustness?** - The authors study the impact of different distillation methods on the model's robustness and find that although the distillation rate has a certain impact on robustness, it is not the main factor. 3. **In classification tasks, is dataset distillation fair to the prediction results of each category?** - The authors find that dataset distillation will magnify the unfairness between different categories, and this unfairness intensifies as the distillation rate increases. ### Specific Problem Analysis #### 1. Can Dataset Distillation Protect Privacy? - **Experimental Design**: The authors perform dataset distillation using training sets of different sizes and evaluate privacy risks through membership inference attacks. - **Experimental Results**: The results show that the higher the distillation rate, the greater the model's vulnerability to MIAs. For example, when the distillation rate is 25%, the AUC value of the IDC method reaches 98.07%, indicating a high privacy risk. - **Conclusion**: Dataset distillation does not have an inherent privacy - protecting ability and may instead increase the risk of privacy leakage. #### 2. The Impact of Dataset Distillation on Model Robustness - **Experimental Design**: The authors use adversarial sample attacks (such as DeepFoolAttack) to evaluate the model's robustness. - **Experimental Results**: The results show that different distillation methods have different degrees of impact on the model's robustness, but the distillation rate is not the decisive factor. - **Conclusion**: Dataset distillation may reduce the model's robustness, but the specific impact depends on the distillation method used. #### 3. The Impact of Dataset Distillation on Model Fairness - **Experimental Design**: The authors evaluate the model's fairness by comparing the accuracy and loss distributions on different categories. - **Experimental Results**: The results show that dataset distillation will magnify the unfairness between different categories. Especially for categories with poor original performance, the information loss is more serious. - **Conclusion**: Dataset distillation may lead to an increase in the performance differences of the model on different categories, thus affecting the model's fairness. ### Summary This paper systematically evaluates the performance of dataset distillation techniques in terms of performance, privacy, robustness, and fairness, reveals the potential risks of existing distillation methods, and provides an important reference for future research.