Abstract:In this paper we develop state-of-the-art privacy attacks against Large Language Models (LLMs), where an adversary with some access to the model tries to learn something about the underlying training data. Our headline results are new membership inference attacks (MIAs) against pretrained LLMs that perform hundreds of times better than baseline attacks, and a pipeline showing that over 50% (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM in natural settings. We consider varying degrees of access to the underlying model, pretraining and fine-tuning data, and both MIAs and training data extraction. For pretraining data, we propose two new MIAs: a supervised neural network classifier that predicts training data membership on the basis of (dimensionality-reduced) model gradients, as well as a variant of this attack that only requires logit access to the model by leveraging recent model-stealing work on LLMs. To our knowledge this is the first MIA that explicitly incorporates model-stealing information. Both attacks outperform existing black-box baselines, and our supervised attack closes the gap between MIA attack success against LLMs and the strongest known attacks for other machine learning models. In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance; we then leverage our MIA to extract a large fraction of the fine-tuning dataset from fine-tuned Pythia and Llama models. Our code is available at <a class="link-external link-http" href="http://github.com/safr-ai-lab/pandora-llm" rel="external noopener nofollow">this http URL</a>.

Teach LLMs to Phish: Stealing Private Information from Language Models

Data Stealing Attacks against Large Language Models via Backdooring

Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models

Training Data Leakage Analysis in Language Models

Information Leakage from Embedding in Large Language Models

Extracting Training Data from Large Language Models

Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

Are Large Pre-Trained Language Models Leaking Your Personal Information?

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

Vocabulary Attack to Hijack Large Language Model Applications

What can we learn from Data Leakage and Unlearning for Law?

Combing for Credentials: Active Pattern Extraction from Smart Reply

Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks

Identifying and Mitigating Privacy Risks Stemming from Language Models: A Survey

Can Language Models be Instructed to Protect Personal Information?

Learnable Privacy Neurons Localization in Language Models

Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities

An Inversion Attack Against Obfuscated Embedding Matrix in Language Model Inference

Stealing Machine Learning Models via Prediction APIs

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer