Detoxifying Large Language Models via Knowledge Editing

Mengru Wang,Ningyu Zhang,Ziwen Xu,Zekun Xi,Shumin Deng,Yunzhi Yao,Qishen Zhang,Linyi Yang,Jindong Wang,Huajun Chen
2024-05-28
Abstract:This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxifying approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at <a class="link-external link-https" href="https://github.com/zjunlp/EasyEdit" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper primarily explores how to purify large language models (LLMs) through knowledge editing techniques, enabling them to more effectively handle harmful queries. Specifically, the paper constructs a benchmark dataset called SafeEdit, which covers 9 unsafe categories and is equipped with comprehensive evaluation metrics. Experiments reveal that knowledge editing techniques can effectively purify LLMs with limited impact on overall performance. #### Main Contributions: 1. **Construction of the SafeEdit Benchmark Dataset**: Covers 9 unsafe categories, including offensive prompts, and extends evaluation metrics. 2. **Proposing the DINM Method**: A simple and effective baseline method that locates and edits toxic regions through neural monitoring, requiring only one instance to complete. 3. **In-depth Analysis of Internal Mechanisms**: Demonstrates that existing methods like SFT and DPO may only suppress the activation of toxic parameters, while DINM can mitigate the toxicity of these parameters to some extent, achieving permanent adjustment. #### Research Background: With the development of large language models such as ChatGPT, LLaMA, and Mistral, their ability to handle harmful queries has garnered widespread attention. Although methods like supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO) have significantly improved safety, these models can still be bypassed by carefully designed attack prompts. Therefore, researchers have posed a new question: Is it possible to achieve purification by precisely modifying the toxic regions within LLMs?