Model-Based Differentially Private Knowledge Transfer for Large Language Models

Zhaomin Wu,Jizhou Guo,Junyi Hou,Bingsheng He,Lixin Fan,Qiang Yang
2024-10-14
Abstract:As large language models (LLMs) become increasingly prevalent in web services, effectively leveraging domain-specific knowledge while ensuring privacy has become critical. Existing methods, such as retrieval-augmented generation (RAG) and differentially private data synthesis, often compromise either the utility of domain knowledge or the privacy of sensitive data, limiting their applicability in specialized domains. To address these challenges, we propose \textit{Llamdex}, a novel framework that integrates privacy-preserving, domain-specific models into LLMs. Our approach significantly enhances the accuracy of domain-specific tasks, achieving up to a 26\% improvement compared to existing methods under the same differential privacy constraints. Experimental results show that Llamdex not only improves the accuracy of LLM responses but also maintains comparable inference efficiency to the original LLM, highlighting its potential for real-world applications.
Machine Learning,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of effectively using domain - specific knowledge in large - language models (LLMs) while ensuring data privacy. Specifically, existing methods such as Retrieval - Augmented Generation (RAG) and differentially private data synthesis often make compromises between practicality and privacy protection, limiting their applications in professional fields. These problems are particularly prominent in fields dealing with sensitive data such as healthcare and finance, because data accuracy is crucial in these fields. ### Main challenges 1. **Balance between practicality and privacy**: - Existing knowledge transfer methods (such as RAG, transfer learning, and parameter - efficient fine - tuning) provide high practicality but require direct sharing of domain - specific data, which raises serious privacy issues. - Differentially private data synthesis methods protect privacy by sharing synthetic data, but the noise added to maintain differential privacy can lead to a significant decline in model performance. 2. **Data privacy protection**: - Large companies (servers) are usually unwilling to share their closed - source LLMs, and customers are also unwilling to disclose sensitive data for privacy reasons, which becomes a major obstacle to the development of LLMs that can effectively use domain - specific knowledge. ### Solutions To solve the above problems, the author proposes a new framework named Llamdex. Llamdex achieves its goals in the following ways: 1. **Integrate differentially private models**: - Llamdex integrates domain - specific models protected by differential privacy into the intermediate layer of the LLM as domain experts. These models can be regarded as summaries of data distributions and usually require less noise to maintain the same level of differential privacy. 2. **Training and deployment processes**: - A customer (such as a bank) trains an expert model on its private customer data and ensures differential privacy. Then, this model is shared with the server and integrated into the general LLM, and further fine - tuned using public patterns. The final domain - specific LLM (Llamdex) can process the customer's financial queries according to public patterns without directly accessing private data. 3. **Design challenges and solutions**: - **Input - output space alignment**: Design a trainable mapping module to map original tokens to feature vectors and convert the output of the expert model into multiple token embeddings to bridge the embedding space of the LLM and the operating space of the domain expert. - **Data unavailability problem**: Use randomly generated synthetic data to train the mapping module. These synthetic data are generated in the same pattern as the domain data. Once the domain encoder and decoder are trained, the synthetic expert will be replaced by the domain expert for deployment. ### Experimental results The experimental results show that Llamdex performs excellently on four real - world datasets. Compared with existing methods under the same differential privacy constraints, the accuracy rate is increased by up to 26% while maintaining inference efficiency comparable to that of the original LLM. ### Conclusion Llamdex provides an effective method to improve the performance of LLMs in specific domains while ensuring data privacy. This framework has great potential in practical applications, especially in fields dealing with sensitive data.