Abstract:Recently, a new paradigm of building generalpurpose language models (e.g., Google's Bert and OpenAI's GPT-2) in Natural Language Processing (NLP) for text feature extraction, a standard procedure in NLP systems that converts texts to vectors (i.e., embeddings) for downstream modeling, has arisen and starts to find its application in various downstream NLP tasks and real world systems (e.g., Google's search engine [6]). To obtain general-purpose text embeddings, these language models have highly complicated architectures with millions of learnable parameters and are usually pretrained on billions of sentences before being utilized. As is widely recognized, such a practice indeed improves the state-of-the-art performance of many downstream NLP tasks. However, the improved utility is not for free. We find the text embeddings from general-purpose language models would capture much sensitive information from the plain text. Once being accessed by the adversary, the embeddings can be reverse-engineered to disclose sensitive information of the victims for further harassment. Although such a privacy risk can impose a real threat to the future leverage of these promising NLP tools, there are neither published attacks nor systematic evaluations by far for the mainstream industry-level language models. To bridge this gap, we present the first systematic study on the privacy risks of 8 state-of-the-art language models with 4 diverse case studies. By constructing 2 novel attack classes, our study demonstrates the aforementioned privacy risks do exist and can impose practical threats to the application of general-purpose language models on sensitive data covering identity, genome, healthcare and location. For example, we show the adversary with nearly no prior knowledge can achieve about 75% accuracy when inferring the precise disease site from Bert embeddings of patients' medical descriptions. As possible countermeasures, we propose 4 different defenses (via rounding, differential privacy, adversarial training and subspace projection) to obfuscate the unprotected embeddings for mitigation purpose. With extensive evaluations, we also provide a preliminary analysis on the utilityprivacy trade-off brought by each defense, which we hope may foster future mitigation researches.

Contrast-Then-Approximate: Analyzing Keyword Leakage of Generative Language Models

The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks

The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks

User Inference Attacks on Large Language Models

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

Unveiling the Vulnerability of Private Fine-Tuning in Split-Based Frameworks for Large Language Models: A Bidirectionally Enhanced Attack

Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data

Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models

Data Stealing Attacks against Large Language Models via Backdooring

Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks

Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

A Study of Backdoors in Instruction Fine-tuned Language Models

SoK: Reducing the Vulnerability of Fine-tuned Language Models to Membership Inference Attacks

Goal-guided Generative Prompt Injection Attack on Large Language Models

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning

PreCurious: How Innocent Pre-Trained Language Models Turn into Privacy Traps

Privacy Risks of General-Purpose Language Models.

Teach LLMs to Phish: Stealing Private Information from Language Models

TMI! Finetuned Models Leak Private Information from their Pretraining Data

Hidden Backdoors in Human-Centric Language Models

Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models