Abstract:Language modeling is a keystone task in natural language processing. When training a language model on sensitive information, differential privacy (DP) allows us to quantify the degree to which our private data is protected. However, training algorithms which enforce differential privacy often lead to degradation in model quality. We study the feasibility of learning a language model which is simultaneously high-quality and privacy preserving by tuning a public base model on a private corpus. We find that DP fine-tuning boosts the performance of language models in the private domain, making the training of such models possible.

What problem does this paper attempt to address?

This paper attempts to solve the problem of protecting personal privacy when training language models, especially when using datasets containing sensitive information for training. Specifically, the paper explores how to achieve a high - quality and privacy - protected language model by fine - tuning a publicly pre - trained language model. The research found that using Differential Privacy (DP) technology to fine - tune private datasets can not only improve the performance of the language model in the private domain, but also ensure the privacy protection of personal data during the model training process. The main challenges mentioned in the paper include: - **Model quality degradation**: Traditional differential privacy training methods often lead to a significant decline in model performance. - **Privacy protection requirements**: When dealing with datasets containing personal sensitive information, it is necessary to strictly protect the privacy of individuals, especially in sensitive fields such as medical care. - **Data distribution differences**: The distribution differences between public datasets and private datasets may lead to insufficient generalization ability of the model. To solve these problems, the author proposes a new method, that is, first pre - train a non - privacy language model on a large - scale public dataset, and then use differential privacy technology to fine - tune this model on the private dataset. The experimental results show that this method can not only improve the performance of the model, but also protect the privacy of data to a certain extent.

Differentially Private Language Models Benefit from Public Pre-training

Differentially Private Fine-tuning of Language Models

Differentially Private Distributed Learning for Language Modeling Tasks

Large Language Models Can Be Strong Differentially Private Learners

Differentially Private Language Models for Secure Data Sharing

Differentially Private Natural Language Models: Recent Advances and Future Directions

Does Differential Privacy Impact Bias in Pretrained NLP Models?

Improving Differentially Private Models with Active Learning

Differentially Private Continual Learning using Pre-Trained Models

Fine-Tuning Large Language Models with User-Level Differential Privacy

Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Selective Pre-training for Private Fine-tuning

Pre-training Differentially Private Models with Limited Public Data

Training Private Models That Know What They Don't Know

LMO-DP: Optimizing the Randomization Mechanism for Differentially Private Fine-Tuning (Large) Language Models

Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Differentially Private Optimizers Can Learn Adversarially Robust Models

Efficient and Private: Memorisation under differentially private parameter-efficient fine-tuning in language models

Differentially Private Learning Needs Better Model Initialization and Self-Distillation

Optimal Differentially Private Model Training with Public Data