Abstract:Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at <a class="link-external link-https" href="https://github.com/lxuechen/private-transformers" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to overcome the challenges of performance degradation and high computational cost when training large - scale language models using Differential Privacy (DP). Specifically, the paper focuses on: 1. **Performance Degradation Problem**: Directly applying Differential Privacy Stochastic Gradient Descent (DP - SGD) to Natural Language Processing (NLP) tasks will lead to a significant decline in model performance. 2. **Computational Cost Problem**: When running DP - SGD to train large - scale Transformer models, due to the need to clip the gradient of each sample, it results in huge memory consumption. To solve these problems, the paper proposes the following methods: 1. **Using Large - scale Pretrained Language Models**: By fine - tuning large - scale language models that have already been pretrained (such as BERT, RoBERTa, GPT - 2, etc.), model performance can be improved while maintaining privacy. 2. **Non - standard Hyperparameter Settings**: Select hyperparameters suitable for Differential Privacy optimization, such as learning rate, batch size, etc., to enhance model performance. 3. **Fine - tuning Objectives Consistent with Pretraining Objectives**: Design fine - tuning tasks that match the pretraining process, for example, transform classification tasks into tasks of filling masked words, to reduce the differences between pretraining and fine - tuning. 4. **Ghost Clipping Technique**: A new memory - saving technique is proposed to avoid instantiating the gradient of each sample in DP - SGD, thereby greatly reducing memory consumption and enabling large - scale Transformer models to be trained as efficiently as non - privacy training. Through these methods, the paper shows that under the same privacy budget, the proposed model can not only surpass existing Differential Privacy training models, but also in some cases even outperform powerful non - privacy baseline models. In addition, the paper also reveals an interesting finding: although high - dimensional models are generally considered to perform poorly in Differential Privacy optimization, experimental results show that larger pretrained models can instead bring better private fine - tuning effects.

Large Language Models Can Be Strong Differentially Private Learners

An Efficient DP-SGD Mechanism for Large Scale NLP Models

When Does Differentially Private Learning Not Suffer in High Dimensions?

Differentially Private Language Models Benefit from Public Pre-training

DP-FP: Differentially Private Forward Propagation for Large Models

Private Knowledge Transfer via Model Distillation with Generative Adversarial Networks

Harnessing large-language models to generate private synthetic text

Equivariant Differentially Private Deep Learning: Why DP-SGD Needs Sparser Models

LMO-DP: Optimizing the Randomization Mechanism for Differentially Private Fine-Tuning (Large) Language Models

Pre-training Differentially Private Models with Limited Public Data

Fine-Tuning Large Language Models with User-Level Differential Privacy

Differentially Private Optimization on Large Model at Small Cost

Differentially Private Fine-tuning of Language Models

DPFormer: Learning Differentially Private Transformer on Long-Tailed Data

Private Fine-tuning of Large Language Models with Zeroth-order Optimization

Selective Pre-training for Private Fine-tuning

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Learning Differentially Private Recurrent Language Models

Sparsity-Preserving Differentially Private Training of Large Embedding Models

Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Differentially Private Next-Token Prediction of Large Language Models