A Split-and-Privatize Framework for Large Language Model Fine-Tuning

Xicong Shen,Yang Liu,Huiqi Liu,Jue Hong,Bing Duan,Zirui Huang,Yunlong Mao,Ye Wu,Di Wu
DOI: https://doi.org/10.48550/arXiv.2312.15603
2023-12-25
Abstract:Fine-tuning is a prominent technique to adapt a pre-trained language model to downstream scenarios. In parameter-efficient fine-tuning, only a small subset of modules are trained over the downstream datasets, while leaving the rest of the pre-trained model frozen to save computation resources. In recent years, a popular productization form arises as Model-as-a-Service (MaaS), in which vendors provide abundant pre-trained language models, server resources and core functions, and customers can fine-tune, deploy and invoke their customized model by accessing the one-stop MaaS with their own private dataset. In this paper, we identify the model and data privacy leakage risks in MaaS fine-tuning, and propose a Split-and-Privatize (SAP) framework, which manage to mitigate the privacy issues by adapting the existing split learning architecture. The proposed SAP framework is sufficiently investigated by experiments, and the results indicate that it can enhance the empirical privacy by 62% at the cost of 1% model performance degradation on the Stanford Sentiment Treebank dataset.
Computation and Language
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problems of data and model privacy leakage when fine - tuning large - scale language models in the Model - as - a - Service (MaaS) scenario. Specifically: 1. **Model privacy**: Pretrained language models (PLMs) usually contain millions or even billions of parameters, which are regarded as the proprietary assets of vendors and cannot be made public. Therefore, customers cannot directly access the complete model weights. 2. **Data privacy**: Customers' text data usually contains sensitive information, such as identity and asset information. If the raw data or representations are directly transmitted to the vendor, it may lead to serious privacy leakage. 3. **Balance between privacy protection and performance**: Although existing privacy - protection methods (such as differential privacy) can protect data privacy, they often reduce the performance of the model on downstream tasks. Therefore, a method that can protect privacy and maintain model performance is required. To solve the above problems, the authors propose a Split - and - Privatize (SAP) framework based on the existing split - learning architecture. The SAP framework alleviates the privacy leakage problem by splitting the model and applying privacy - protection mechanisms, and optimizes the trade - off between privacy and utility through the Contribution Token Identification (CTI) method. Experimental results show that the SAP framework can maintain high model performance while protecting model and data privacy.