Abstract:Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present \textbf{Embedding-COrrupted (ECO) Prompts}, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at \textit{nearly zero side effects} in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases. We have made our code publicly available at \url{<a class="link-external link-https" href="https://github.com/chrisliu298/llm-unlearn-eco" rel="external noopener nofollow">this https URL</a>}.

Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference

LLM Unlearning via Loss Adjustment with Only Forget Data

Unlearn What You Want to Forget: Efficient Unlearning for LLMs

MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts

Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

ULMR: Unlearning Large Language Models Via Negative Response and Model Parameter Average

Practical Unlearning for Large Language Models

A Closer Look at Machine Unlearning for Large Language Models

To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models

Offset Unlearning for Large Language Models

UNLEARN Efficient Removal of Knowledge in Large Language Models

Does your LLM truly unlearn? An embarrassingly simple approach to recover unlearned knowledge

To Each (Textual Sequence) Its Own: Improving Memorized-Data Unlearning in Large Language Models

Unified Parameter-Efficient Unlearning for LLMs

Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models

Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models

Towards Robust Evaluation of Unlearning in LLMs via Data Transformations

Soft Prompting for Unlearning in Large Language Models

LoRA Unlearns More and Retains More (Student Abstract)

Large Language Model Unlearning via Embedding-Corrupted Prompts