Machine Unlearning in Large Language Models

Saaketh Koundinya Gundavarapu,Shreya Agarwal,Arushi Arora,Chandana Thimmalapura Jagadeeshaiah
2024-05-24
Abstract:Machine unlearning, a novel area within artificial intelligence, focuses on addressing the challenge of selectively forgetting or reducing undesirable knowledge or behaviors in machine learning models, particularly in the context of large language models (LLMs). This paper introduces a methodology to align LLMs, such as Open Pre-trained Transformer Language Models, with ethical, privacy, and safety standards by leveraging the gradient ascent algorithm for knowledge unlearning. Our approach aims to selectively erase or modify learned information in LLMs, targeting harmful responses and copyrighted content. This paper presents a dual-pronged approach to enhance the ethical and safe behavior of large language models (LLMs) by addressing the issues of harmful responses and copyrighted content. To mitigate harmful responses, we applied gradient ascent on the PKU dataset, achieving a 75\% reduction in harmful responses for Open Pre-trained Transformer Language Models (OPT1.3b and OPT2.7b) \citet{zhang2022opt} while retaining previous knowledge using the TruthfulQA dataset \citet{DBLP:journals/corr/abs-2109-07958}. For handling copyrighted content, we constructed a custom dataset based on the Lord of the Rings corpus and aligned LLMs (OPT1.3b and OPT2.7b) \citet{zhang2022opt} through LoRA: Low-Rank Adaptation of Large Language Models \citet{DBLP:journals/corr/abs-2106-09685} finetuning. Subsequently, we employed gradient ascent to unlearn the Lord of the Rings content, resulting in a remarkable reduction in the presence of copyrighted material. To maintain a diverse knowledge base, we utilized the Book Corpus dataset. Additionally, we propose a new evaluation technique for assessing the effectiveness of harmful unlearning.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper discusses the application of machine unlearning in large language models (LLMs), focusing on selectively forgetting or reducing unwanted knowledge or behavior from these models, particularly with regard to harmful responses and copyrighted content. The study proposes a method for knowledge unlearning using gradient ascent algorithm to make LLMs comply with moral, privacy, and security standards. The paper first optimizes Open Pre-trained Transformer Language Models (such as OPT1.3b and OPT2.7b) by applying gradient ascent on the PKU dataset to reduce harmful responses, successfully lowering the rate of harmful responses while preserving valuable knowledge from the TruthfulQA dataset. For handling copyrighted content, the study constructs a custom dataset based on "The Lord of the Rings" corpus and fine-tunes LLMs using LoRA, then uses gradient ascent algorithm to eliminate "The Lord of the Rings" content, significantly reducing the presence of copyrighted material. To maintain knowledge diversity, the study also utilizes the Book Corpus dataset. In addition, the paper presents a new evaluation technique to measure the effectiveness of harmful content unlearning by training a classifier to determine if text is harmful, and then testing the aligned LLM to provide a quantitative measurement of the model's ability to forget harmful content. Overall, the main contributions of the paper include: 1. Reducing harmful responses on the PKU dataset using gradient ascent, significantly decreasing harmful output while preserving valuable knowledge with the TruthfulQA dataset. 2. Handling copyrighted content using a custom dataset based on "The Lord of the Rings" and LoRA fine-tuning, effectively eliminating this content through gradient ascent to reduce its presence in the model's output. 3. Introducing a new evaluation technique to quantitatively assess the model's ability to forget harmful content by training a classifier. The paper also discusses other unlearning methods such as RLHF, as well as the effects of using different optimizers (such as Adam, Adagrad, etc.), and introduces techniques to evaluate whether an LLM has forgotten concepts, particularly when dealing with prompts for rephrasing.