Fine-grained detoxification framework via instance-level prefixes for large language models
Xin Yi,Linlin Wang,Xiaoling Wang,Liang He
DOI: https://doi.org/10.1016/j.neucom.2024.128684
IF: 6
2024-10-05
Neurocomputing
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their practical usability is often compromised by a propensity to generate toxic content, such as insults, threats, and profanity, particularly in response to adversarial prompts. Several fine-tuning and decoding approaches have been employed to address this challenge to mitigate toxicity. Nevertheless, these methods typically necessitate additional resources, such as high-quality training data or auxiliary models, thereby incurring extra costs. In this paper, we propose a novel detoxification framework, Fine-Grained Detoxification via Instance-Level Prefixes (FGDILP), which effectively mitigates the generation of toxic text without incurring additional training costs. Specifically, FGDILP leverages contextualized representations in attention space by contrasting a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This methodology facilitates the construction of fine-grained subtoxicity vectors, which are subsequently fused to adjust the original generation pathway when responding to raw prompts. Our results demonstrate that FGDILP enables controlled text generation concerning detoxification at both the utterance and context levels. While our method slightly impacts generation fluency and diversity, it significantly outperforms prompt-based baselines regarding detoxification effectiveness. Our code is available at https://github.com/xinykou/FGDILP .
computer science, artificial intelligence