Abstract:Large Language Models (LLMs) have become a focal point in the rapidly evolving field of artificial intelligence. However, a critical concern is the presence of toxic content within the pre-training corpus of these models, which can lead to the generation of inappropriate outputs. Investigating methods for detecting internal faults in LLMs can help us understand their limitations and improve their security. Existing methods primarily focus on jailbreaking attacks, which involve manually or automatically constructing adversarial content to prompt the target LLM to generate unexpected responses. These methods rely heavily on prompt engineering, which is time-consuming and usually requires specially designed questions. To address these challenges, this paper proposes a target-driven attack paradigm that focuses on directly eliciting the target response instead of optimizing the prompts. We introduce the use of another LLM as the detector for toxic content, referred to as ToxDet. Given a target toxic response, ToxDet can generate a possible question and a preliminary answer to provoke the target model into producing desired toxic responses with meanings equivalent to the provided one. ToxDet is trained by interacting with the target LLM and receiving reward signals from it, utilizing reinforcement learning for the optimization process. While the primary focus of the target models is on open-source LLMs, the fine-tuned ToxDet can also be transferred to attack black-box models such as GPT-4o, achieving notable results. Experimental results on AdvBench and HH-Harmless datasets demonstrate the effectiveness of our methods in detecting the tendencies of target LLMs to generate harmful responses. This algorithm not only exposes vulnerabilities but also provides a valuable resource for researchers to strengthen their models against such attacks.

Fortifying Toxic Speech Detectors Against Veiled Toxicity

Towards Robust Toxic Content Classification

Protecting marginalized communities by mitigating discrimination in toxic language detection

Facilitating Fine-grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks

Clinical practice. The syndrome of inappropriate antidiuresis.

Robust Conversational Agents against Imperceptible Toxicity Triggers

Towards Building a Robust Toxicity Predictor

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

Shielding Google's language toxicity model against adversarial attacks

Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models

RECAST: Interactive Auditing of Automatic Toxicity Detection Models

TaeBench: Improving Quality of Toxic Adversarial Examples

ToXCL: A Unified Framework for Toxic Speech Detection and Explanation

On the Role of Speech Data in Reducing Toxicity Detection Bias

Developing Linguistic Patterns to Mitigate Inherent Human Bias in Offensive Language Detection

Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language

Take its Essence, Discard its Dross! Debiasing for Toxic Language Detection via Counterfactual Causal Effect

DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances

Automated Adversarial Discovery for Safety Classifiers