Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Xiaochen Li,Zheng-Xin Yong,Stephen H. Bach

2024-06-24

Abstract:Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

Computation and Language,Artificial Intelligence,Cryptography and Security,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve cross - language toxicity mitigation in multilingual large language models (LLMs). Specifically, the author explores whether preference tuning based solely on English data can reduce toxic generation in other languages in a zero - shot situation. This research aims to overcome the resource - intensive problem of existing methods that require collecting toxic and non - toxic sample datasets for each language, thereby providing a more efficient and more general solution to reduce the toxicity level of multilingual LLMs in open - generation tasks. The main contributions of the paper include: 1. It is the first to show that preference tuning can be used for toxicity mitigation, and this tuning effect can be generalized across languages. 2. It reveals the dual multilingual properties of multilingual linear layers (MLP layers) and explains the mechanism of cross - language preference tuning generalization. 3. It discovers a strong correlation between bilingual sentence retrieval accuracy and the cross - language transfer ability of English preference tuning on specific languages. Through these findings, the paper not only provides a theoretical explanation but also offers guidance on how to evaluate and select language pairs suitable for cross - language toxicity mitigation in practical applications.

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models

Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models

DeTox: Toxic Subspace Projection for Model Editing

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

On the Generalization of Preference Learning with DPO

Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias

Self-Detoxifying Language Models via Toxification Reversal

Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models

Challenges in Detoxifying Language Models

Fine-grained detoxification framework via instance-level prefixes for large language models

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content

Unveiling the Implicit Toxicity in Large Language Models

Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

Large Language Models can be Strong Self-Detoxifiers

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models

Detoxifying Language Models Risks Marginalizing Minority Voices

Fine Tuning Large Language Models for Medicine: The Role and Importance of Direct Preference Optimization