Abstract:Machine unlearning, a novel area within artificial intelligence, focuses on addressing the challenge of selectively forgetting or reducing undesirable knowledge or behaviors in machine learning models, particularly in the context of large language models (LLMs). This paper introduces a methodology to align LLMs, such as Open Pre-trained Transformer Language Models, with ethical, privacy, and safety standards by leveraging the gradient ascent algorithm for knowledge unlearning. Our approach aims to selectively erase or modify learned information in LLMs, targeting harmful responses and copyrighted content. This paper presents a dual-pronged approach to enhance the ethical and safe behavior of large language models (LLMs) by addressing the issues of harmful responses and copyrighted content. To mitigate harmful responses, we applied gradient ascent on the PKU dataset, achieving a 75\% reduction in harmful responses for Open Pre-trained Transformer Language Models (OPT1.3b and OPT2.7b) \citet{zhang2022opt} while retaining previous knowledge using the TruthfulQA dataset \citet{DBLP:journals/corr/abs-2109-07958}. For handling copyrighted content, we constructed a custom dataset based on the Lord of the Rings corpus and aligned LLMs (OPT1.3b and OPT2.7b) \citet{zhang2022opt} through LoRA: Low-Rank Adaptation of Large Language Models \citet{DBLP:journals/corr/abs-2106-09685} finetuning. Subsequently, we employed gradient ascent to unlearn the Lord of the Rings content, resulting in a remarkable reduction in the presence of copyrighted material. To maintain a diverse knowledge base, we utilized the Book Corpus dataset. Additionally, we propose a new evaluation technique for assessing the effectiveness of harmful unlearning.

What problem does this paper attempt to address?

This paper discusses the application of machine unlearning in large language models (LLMs), focusing on selectively forgetting or reducing unwanted knowledge or behavior from these models, particularly with regard to harmful responses and copyrighted content. The study proposes a method for knowledge unlearning using gradient ascent algorithm to make LLMs comply with moral, privacy, and security standards. The paper first optimizes Open Pre-trained Transformer Language Models (such as OPT1.3b and OPT2.7b) by applying gradient ascent on the PKU dataset to reduce harmful responses, successfully lowering the rate of harmful responses while preserving valuable knowledge from the TruthfulQA dataset. For handling copyrighted content, the study constructs a custom dataset based on "The Lord of the Rings" corpus and fine-tunes LLMs using LoRA, then uses gradient ascent algorithm to eliminate "The Lord of the Rings" content, significantly reducing the presence of copyrighted material. To maintain knowledge diversity, the study also utilizes the Book Corpus dataset. In addition, the paper presents a new evaluation technique to measure the effectiveness of harmful content unlearning by training a classifier to determine if text is harmful, and then testing the aligned LLM to provide a quantitative measurement of the model's ability to forget harmful content. Overall, the main contributions of the paper include: 1. Reducing harmful responses on the PKU dataset using gradient ascent, significantly decreasing harmful output while preserving valuable knowledge with the TruthfulQA dataset. 2. Handling copyrighted content using a custom dataset based on "The Lord of the Rings" and LoRA fine-tuning, effectively eliminating this content through gradient ascent to reduce its presence in the model's output. 3. Introducing a new evaluation technique to quantitatively assess the model's ability to forget harmful content by training a classifier. The paper also discusses other unlearning methods such as RLHF, as well as the effects of using different optimizers (such as Adam, Adagrad, etc.), and introduces techniques to evaluate whether an LLM has forgotten concepts, particularly when dealing with prompts for rephrasing.

Machine Unlearning in Large Language Models

A Closer Look at Machine Unlearning for Large Language Models

Machine Unlearning in Large Language Models

Machine Unlearning of Pre-trained Large Language Models

Rethinking Machine Unlearning for Large Language Models

Fine-grained Pluggable Gradient Ascent for Knowledge Unlearning in Language Models

Towards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models

Towards Safer Large Language Models through Machine Unlearning

The Frontier of Data Erasure: Machine Unlearning for Large Language Models

UNLEARN Efficient Removal of Knowledge in Large Language Models

LLM Surgery: Efficient Knowledge Unlearning and Editing in Large Language Models

Offset Unlearning for Large Language Models

Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models

Alternate Preference Optimization for Unlearning Factual Knowledge in Large Language Models

Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models

Towards Robust Evaluation of Unlearning in LLMs via Data Transformations

Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning

Who's Harry Potter? Approximate Unlearning in LLMs

Cross-Lingual Unlearning of Selective Knowledge in Multilingual Language Models

RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models