Large Language Models Can Be Strong Differentially Private Learners

Xuechen Li,Florian Tramèr,Percy Liang,Tatsunori Hashimoto
DOI: https://doi.org/10.48550/arXiv.2110.05679
2022-11-11
Abstract:Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at <a class="link-external link-https" href="https://github.com/lxuechen/private-transformers" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to overcome the challenges of performance degradation and high computational cost when training large - scale language models using Differential Privacy (DP). Specifically, the paper focuses on: 1. **Performance Degradation Problem**: Directly applying Differential Privacy Stochastic Gradient Descent (DP - SGD) to Natural Language Processing (NLP) tasks will lead to a significant decline in model performance. 2. **Computational Cost Problem**: When running DP - SGD to train large - scale Transformer models, due to the need to clip the gradient of each sample, it results in huge memory consumption. To solve these problems, the paper proposes the following methods: 1. **Using Large - scale Pretrained Language Models**: By fine - tuning large - scale language models that have already been pretrained (such as BERT, RoBERTa, GPT - 2, etc.), model performance can be improved while maintaining privacy. 2. **Non - standard Hyperparameter Settings**: Select hyperparameters suitable for Differential Privacy optimization, such as learning rate, batch size, etc., to enhance model performance. 3. **Fine - tuning Objectives Consistent with Pretraining Objectives**: Design fine - tuning tasks that match the pretraining process, for example, transform classification tasks into tasks of filling masked words, to reduce the differences between pretraining and fine - tuning. 4. **Ghost Clipping Technique**: A new memory - saving technique is proposed to avoid instantiating the gradient of each sample in DP - SGD, thereby greatly reducing memory consumption and enabling large - scale Transformer models to be trained as efficiently as non - privacy training. Through these methods, the paper shows that under the same privacy budget, the proposed model can not only surpass existing Differential Privacy training models, but also in some cases even outperform powerful non - privacy baseline models. In addition, the paper also reveals an interesting finding: although high - dimensional models are generally considered to perform poorly in Differential Privacy optimization, experimental results show that larger pretrained models can instead bring better private fine - tuning effects.