Detoxifying Large Language Models via Knowledge Editing

Mengru Wang,Ningyu Zhang,Ziwen Xu,Zekun Xi,Shumin Deng,Yunzhi Yao,Qishen Zhang,Linyi Yang,Jindong Wang,Huajun Chen

2024-05-28

Abstract:This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxifying approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at <a class="link-external link-https" href="https://github.com/zjunlp/EasyEdit" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Human-Computer Interaction,Machine Learning

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper primarily explores how to purify large language models (LLMs) through knowledge editing techniques, enabling them to more effectively handle harmful queries. Specifically, the paper constructs a benchmark dataset called SafeEdit, which covers 9 unsafe categories and is equipped with comprehensive evaluation metrics. Experiments reveal that knowledge editing techniques can effectively purify LLMs with limited impact on overall performance. #### Main Contributions: 1. **Construction of the SafeEdit Benchmark Dataset**: Covers 9 unsafe categories, including offensive prompts, and extends evaluation metrics. 2. **Proposing the DINM Method**: A simple and effective baseline method that locates and edits toxic regions through neural monitoring, requiring only one instance to complete. 3. **In-depth Analysis of Internal Mechanisms**: Demonstrates that existing methods like SFT and DPO may only suppress the activation of toxic parameters, while DINM can mitigate the toxicity of these parameters to some extent, achieving permanent adjustment. #### Research Background: With the development of large language models such as ChatGPT, LLaMA, and Mistral, their ability to handle harmful queries has garnered widespread attention. Although methods like supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO) have significantly improved safety, these models can still be bypassed by carefully designed attack prompts. Therefore, researchers have posed a new question: Is it possible to achieve purification by precisely modifying the toxic regions within LLMs?

Detoxifying Large Language Models via Knowledge Editing

Precision Knowledge Editing: Enhancing Safety in Large Language Models

A Comprehensive Study of Knowledge Editing for Large Language Models

Unveiling the Pitfalls of Knowledge Editing for Large Language Models

Knowledge Editing on Black-box Large Language Models

EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models

Identifying Knowledge Editing Types in Large Language Models

Uncovering Overfitting in Large Language Model Editing

Editing the Mind of Giants: An In-Depth Exploration of Pitfalls of Knowledge Editing in Large Language Models

Editing Large Language Models: Problems, Methods, and Opportunities

Editing Factual Knowledge and Explanatory Ability of Medical Large Language Models

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

Editing Conceptual Knowledge for Large Language Models

AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models

Benchmarking Chinese Knowledge Rectification in Large Language Models

Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue

Large Language Models can be Strong Self-Detoxifiers