Abstract:Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at <a class="link-external link-https" href="https://github.com/lxuechen/private-transformers" rel="external noopener nofollow">this https URL</a>.

Sparsity-Preserving Differentially Private Training of Large Embedding Models

Equivariant Differentially Private Deep Learning: Why DP-SGD Needs Sparser Models

Towards Efficient and Scalable Training of Differentially Private Deep Learning

An Efficient DP-SGD Mechanism for Large Scale NLP Models

A(DP)$^2$2SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

A(DP)$^2$SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

EANA: Reducing Privacy Risk on Large-scale Recommendation Models

Enhancing DP-SGD through Non-monotonous Adaptive Scaling Gradient Weight

Differentially Private Optimization with Sparse Gradients

DPDR: Gradient Decomposition and Reconstruction for Differentially Private Deep Learning

DP-FP: Differentially Private Forward Propagation for Large Models

Dynamic Differential-Privacy Preserving SGD

LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

Large Language Models Can Be Strong Differentially Private Learners

Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent

A Stochastic Gradient Descent Algorithm Based on Adaptive Differential Privacy

Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning

DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

Improving Differentially Private SGD via Randomly Sparsified Gradients