Abstract:When language models (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.

Measuring Forgetting of Memorized Training Examples

FedME2: Memory Evaluation & Erase Promoting Federated Unlearning in DTMN

Learn to Forget: Memorization Elimination for Neural Networks.

Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI Models

Machine Learning Models that Remember Too Much

Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

"Forgetting" in Machine Learning and Beyond: A Survey

Measuring Catastrophic Forgetting in Neural Networks

Example forgetting and rehearsal in continual learning

Learn to Forget: Machine Unlearning Via Neuron Masking

Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics

Unforgettable Generalization in Language Models

Learning with Recoverable Forgetting

Digital Forgetting in Large Language Models: A Survey of Unlearning Methods

On the Privacy Effect of Data Enhancement Via the Lens of Memorization

The Privacy Onion Effect: Memorization is Relative

Mixed-Privacy Forgetting in Deep Networks

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Counterfactual Memorization in Neural Language Models

How much can we forget about Data Contamination?