Abstract:The success of large language models (LLMs) facilitate many parties to fine-tune LLMs on their own private data. However, this practice raises privacy concerns due to the memorization of LLMs. Existing solutions, such as utilizing synthetic data for substitution, struggle to simultaneously improve performance and preserve privacy. They either rely on a local model for generation, resulting in a performance decline, or take advantage of APIs, directly exposing the data to API servers. To address this issue, we propose KnowledgeSG, a novel client-server framework which enhances synthetic data quality and improves model performance while ensuring privacy. We achieve this by learning local knowledge from the private data with differential privacy (DP) and distilling professional knowledge from the server. Additionally, inspired by federated learning, we transmit models rather than data between the client and server to prevent privacy leakage. Extensive experiments in medical and financial domains demonstrate the effectiveness of KnowledgeSG. Our code is now publicly available at <a class="link-external link-https" href="https://github.com/wwh0411/KnowledgeSG" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of how to strictly protect privacy while improving the quality of synthetic data when fine - tuning with large - language models (LLMs). Specifically, the paper addresses the following two main challenges: 1. **Privacy risks**: When fine - tuning LLMs on private data, these models may remember sensitive information in the training data, leading to the risk of privacy leakage. Existing methods either rely on local models to generate synthetic data, which will lead to performance degradation; or generate data through API servers, directly exposing private data to third parties. 2. **Performance - privacy trade - off**: Existing solutions have difficulty improving model performance and protecting privacy simultaneously. For example, API - based methods increase privacy risks, while methods relying solely on local models will lead to performance degradation due to the low quality of synthetic data. To solve these problems, the authors propose a new framework, **KnowledgeSG**, which combines client - side and server - side knowledge distillation techniques to improve the quality of synthetic data while ensuring privacy, and further improve the model's performance. ### Core contributions of KnowledgeSG 1. **Propose a new privacy - protecting client - server framework**: By extracting knowledge from professional models on the server - side to enhance the client - side's synthetic data generation ability, thereby improving data quality and model performance while protecting privacy. 2. **Introduce a new server - side synthetic data generation method**: Use professional models to judge and correct the original synthetic data to ensure that the generated instructions and responses are of high quality and meet domain requirements. 3. **Verify the effectiveness of the framework through extensive experiments**: The experimental results show that KnowledgeSG performs excellently in privacy protection and performance benchmark tests in the medical and financial fields, even surpassing non - privacy methods and some professional baseline models. ### Summary This paper solves the contradiction between privacy and performance when fine - tuning large - language models by designing an innovative client - server framework, providing a new method that can both protect privacy and improve model performance.

KnowledgeSG: Privacy-Preserving Synthetic Text Generation with Knowledge Distillation from Server

Private Knowledge Transfer via Model Distillation with Generative Adversarial Networks

PKDGAN: Private Knowledge Distillation with Generative Adversarial Networks

MCKD: Mutually Collaborative Knowledge Distillation for Federated Domain Adaptation and Generalization

Differentially Private Knowledge Distillation via Synthetic Text Generation

Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data

Safe Distillation Box

Learning Privacy-Preserving Student Networks via Discriminative-Generative Distillation

Selective Knowledge Sharing for Privacy-Preserving Federated Distillation without A Good Teacher

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

Privacy-Preserving Knowledge Distillation in Latency-Critical Federated Task Offloading for Consumer IoT Networks

Locally Differentially Private Distributed Deep Learning via Knowledge Distillation

Small Scale Data-Free Knowledge Distillation

Model-Based Differentially Private Knowledge Transfer for Large Language Models

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

PDSS: A Privacy-Preserving Framework for Step-by-Step Distillation of Large Language Models

FedMKGC: Privacy-Preserving Federated Multilingual Knowledge Graph Completion

Fine-grained Private Knowledge Distillation

Personalized and privacy-enhanced federated learning framework via knowledge distillation