Toxicity Detection for Free

Zhanhao Hu,Julien Piet,Geng Zhao,Jiantao Jiao,David Wagner
2024-05-29
Abstract:Current LLMs are generally aligned to follow safety requirements and tend to refuse toxic prompts. However, LLMs can fail to refuse toxic prompts or be overcautious and refuse benign examples. In addition, state-of-the-art toxicity detectors have low TPRs at low FPR, incurring high costs in real-world applications where toxic examples are rare. In this paper, we explore Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We found significant gaps between benign and toxic prompts in the distribution of alternative refusal responses and in the distribution of the first response token's logits. These gaps can be used to detect toxicities: We show that a toy model based on the logits of specific starting tokens gets reliable performance, while requiring no training or additional computational cost. We build a more robust detector using a sparse logistic regression model on the first response token logits, which greatly exceeds SOTA detectors under multiple metrics.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the detection issues that large language models (LLMs) face when confronted with toxic prompts. Although current LLMs are adjusted to follow safety requirements and tend to reject toxic prompts, they sometimes become overly cautious and reject benign prompts or fail to correctly identify toxic content. Additionally, existing state-of-the-art toxicity detectors have a low true positive rate (TPR) at a low false positive rate (FPR), which can lead to high costs in practical applications because toxic samples are relatively rare. The paper proposes a method called Modulating Using LLM Introspection (MULI), which detects toxic prompts by directly extracting information from the LLM itself. Specifically, the researchers found significant differences between toxic and non-toxic prompts in the distribution of alternative refusal responses and the logits distribution of the first response token. These differences can be used to detect toxicity, and a simple model based on specific initial token logits can achieve reliable performance without additional training or computational costs. Furthermore, they constructed a sparse logistic regression model based on the logits of the first response token, which significantly outperformed existing state-of-the-art detectors across multiple metrics. This method not only improves detection performance but also eliminates additional costs.