Abstract:Prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. Notably, prior works restrict the task of developing detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. Additionally, these works do not address non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified without altering the meaning. We propose DetoxLLM, the first comprehensive end-to-end detoxification framework, which attempts to alleviate the aforementioned limitations. We first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies leveraging ChatGPT. We then train a suite of detoxification models with our cross-platform corpus. We show that our detoxification models outperform the SoTA model trained with human-annotated parallel corpus. We further introduce explanation to promote transparency and trustworthiness. DetoxLLM additionally offers a unique paraphrase detector especially dedicated for the detoxification task to tackle the non-detoxifiable cases. Through experimental analysis, we demonstrate the effectiveness of our cross-platform corpus and the robustness of DetoxLLM against adversarial toxicity.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly include the following aspects: 1. **Cross - platform Detoxification Problem**: Previous detoxification work was usually limited to data within specific platforms, ignoring the language differences between different platforms. This means that the performance of existing detoxification models when facing unseen platforms is still unclear. Therefore, this paper proposes a cross - platform detoxification framework, aiming to improve the generalization ability of the model on different platforms. 2. **Non - Detoxifiability Problem**: Some toxic texts cannot be detoxified without changing the original meaning. This problem has received less attention in previous detoxification research. In this paper, a specialized synonymous sentence detector is introduced to identify these non - detoxifiable situations and issue warnings to users when necessary. 3. **Explainability Problem**: In order to improve the transparency and credibility of the detoxification model, this paper proposes the function of generating explanations, that is, the model will not only output the detoxified text, but also explain why the input text is considered toxic. 4. **Dealing with the Security Limitations of Large - Language Models**: Some large - language models (such as LLMs) may be reluctant to respond to toxic inputs due to the limitations of security mechanisms. This paper explores how to still effectively detoxify in such cases. ### Main Contributions 1. **Proposing the DetoxLLM Framework**: This is the first end - to - end framework that can handle cross - platform detoxification problems, deal with non - detoxifiability problems at the same time, and provide detoxification explanations. 2. **Constructing a Cross - platform Pseudo - parallel Corpus**: Through multi - step data processing and prompt engineering, a cross - platform pseudo - parallel detoxification corpus is constructed. 3. **Experimental Evaluation and Comparison**: Through experimental evaluation, the superior performance of DetoxLLM in cross - platform detoxification tasks is proved, especially in terms of accuracy and fluency. 4. **Training a Specialized Synonymous Sentence Detector**: In order to deal with non - detoxifiability problems, a specialized synonymous sentence detector is trained, whose performance is significantly better than existing synonymous sentence detectors. 5. **Countering Implicit and Word - level Adversarial Toxicity**: Through extensive experimental analysis, the effectiveness of cross - platform data and the robustness of DetoxLLM against implicit and word - level adversarial toxicity are demonstrated. ### Method Overview 1. **Data Collection**: Toxic and non - toxic data are collected from multiple platforms to construct a cross - platform detoxification corpus. 2. **Data Generation**: Parallel toxic and non - toxic data are generated by using ChatGPT through the "jailbreaking" technique. 3. **Data Filtering**: Platform - specific toxicity classifiers are used to filter data to ensure data quality. 4. **Explanation and Synonymous Sentence Label Acquisition**: ChatGPT is used to generate detoxification explanations and synonymous sentence labels. 5. **Model Training**: Multiple models (such as BART, T5, LLaMA - 2) are used for fine - tuning the detoxification task, and the chain - of - thought (CoT) fine - tuning method is introduced. ### Experimental Results - **Cross - platform Detoxification Performance**: DetoxLLM performs excellently in cross - platform detoxification tasks, especially in terms of accuracy and fluency. - **Non - detoxifiability Handling**: Non - detoxifiable situations are effectively identified by the synonymous sentence detector and warnings are issued to users. - **Explainability**: The generated explanations help to improve the transparency and credibility of the model. In conclusion, the DetoxLLM framework proposed in this paper has made significant progress in solving cross - platform detoxification, non - detoxifiability problems and improving model explainability.

DetoxLLM: A Framework for Detoxification with Explanations

Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages

GPT-DETOX: An In-Context Learning-Based Paraphraser for Text Detoxification

CMD: a framework for Context-aware Model self-Detoxification

Fine-grained detoxification framework via instance-level prefixes for large language models

Challenges in Detoxifying Language Models

Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models

DiffuDetox: A Mixed Diffusion Model for Text Detoxification

Multilingual Text Detoxification Using Google Cloud Translation and Post-Processing

Self-Detoxifying Language Models via Toxification Reversal

Large Language Models can be Strong Self-Detoxifiers

Detoxifying Large Language Models via Knowledge Editing

Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification

Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models

Language Detoxification with Attribute-Discriminative Latent Space

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion

Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Reward Modeling for Mitigating Toxicity in Transformer-based Language Models

Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models