KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

Wenhao Wang,Xiaoyu Liang,Rui Ye,Jingyi Chai,Siheng Chen,Yanfeng Wang
2024-10-10
Abstract:The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage. Extensive experiments in medical and financial domains demonstrate the effectiveness of KnowledgeSG. Our code is now publicly available at <a class="link-external link-https" href="https://github.com/wwh0411/KnowledgeSG" rel="external noopener nofollow">this https URL</a>.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of how to strictly protect privacy while improving the quality of synthetic data when fine - tuning with large - language models (LLMs). Specifically, the paper addresses the following two main challenges: 1. **Privacy risks**: When fine - tuning LLMs on private data, these models may remember sensitive information in the training data, leading to the risk of privacy leakage. Existing methods either rely on local models to generate synthetic data, which will lead to performance degradation; or generate data through API servers, directly exposing private data to third parties. 2. **Performance - privacy trade - off**: Existing solutions have difficulty improving model performance and protecting privacy simultaneously. For example, API - based methods increase privacy risks, while methods relying solely on local models will lead to performance degradation due to the low quality of synthetic data. To solve these problems, the authors propose a new framework, **KnowledgeSG**, which combines client - side and server - side knowledge distillation techniques to improve the quality of synthetic data while ensuring privacy, and further improve the model's performance. ### Core contributions of KnowledgeSG 1. **Propose a new privacy - protecting client - server framework**: By extracting knowledge from professional models on the server - side to enhance the client - side's synthetic data generation ability, thereby improving data quality and model performance while protecting privacy. 2. **Introduce a new server - side synthetic data generation method**: Use professional models to judge and correct the original synthetic data to ensure that the generated instructions and responses are of high quality and meet domain requirements. 3. **Verify the effectiveness of the framework through extensive experiments**: The experimental results show that KnowledgeSG performs excellently in privacy protection and performance benchmark tests in the medical and financial fields, even surpassing non - privacy methods and some professional baseline models. ### Summary This paper solves the contradiction between privacy and performance when fine - tuning large - language models by designing an innovative client - server framework, providing a new method that can both protect privacy and improve model performance.