NoisyHate: Benchmarking Content Moderation Machine Learning Models with Human-Written Perturbations Online

Yiran Ye,Thai Le,Dongwon Lee

2023-03-18

Abstract:Online texts with toxic content are a threat in social media that might cause cyber harassment. Although many platforms applied measures, such as machine learning-based hate-speech detection systems, to diminish their effect, those toxic content publishers can still evade the system by modifying the spelling of toxic words. Those modified words are also known as human-written text perturbations. Many research works developed certain techniques to generate adversarial samples to help the machine learning models obtain the ability to recognize those perturbations. However, there is still a gap between those machine-generated perturbations and human-written perturbations. In this paper, we introduce a benchmark test set containing human-written perturbations online for toxic speech detection models. We also recruited a group of workers to evaluate the quality of this test set and dropped low-quality samples. Meanwhile, to check if our perturbation can be normalized to its clean version, we applied spell corrector algorithms on this dataset. Finally, we test this data on state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as perspective API, to demonstrate the adversarial attack with real human-written perturbations is still effective.

Machine Learning,Artificial Intelligence,Computation and Language,Computers and Society

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the issue of detecting toxic content (such as hate speech) on social media platforms. Although many platforms have already implemented various measures (e.g., machine learning-based hate speech detection systems) to reduce the impact of such toxic content, malicious users circumvent these detection systems by cleverly modifying toxic words (i.e., human-written text perturbations). Therefore, to help build AI detection systems capable of recognizing these perturbations, existing methods have developed sophisticated techniques to generate diverse adversarial examples. However, the perturbations generated by these algorithms do not necessarily capture all the characteristics of human-written perturbations. To address this issue, the paper introduces a benchmark dataset called **NoisyHate**, which contains perturbations created from real human-written perturbations on various social platforms, to help develop better toxic speech detection models. Additionally, the paper tests state-of-the-art language models (such as BERT and RoBERTa) and commercial toxic content detection APIs (such as Perspective API) to demonstrate the effectiveness of adversarial attacks using real human-written perturbations. ### Main Contributions 1. **Introduction of the novel benchmark dataset NoisyHate**: This dataset contains online human-written perturbations for toxic speech detection models. 2. **Testing the NoisyHate dataset with various spell checkers**: Demonstrates the importance of developing better normalization tools for these online human-written perturbations. 3. **Evaluation of state-of-the-art language models and commercial toxic detection APIs**: Reveals that these models still have room for improvement when predicting texts containing human-written perturbations.

NoisyHate: Benchmarking Content Moderation Machine Learning Models with Human-Written Perturbations Online

Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense

Shielding Google's language toxicity model against adversarial attacks

Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

HateModerate: Testing Hate Speech Detectors against Content Moderation Policies

Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models

The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

Leveraging Large Language Models and Topic Modeling for Toxicity Classification

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Navigating the Shadows: Unveiling Effective Disturbances for Modern AI Content Detectors

Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data

Towards Robust Toxic Content Classification

Fine-Tuning Pre-trained Language Models to Detect In-Game Trash Talks

Probing LLMs for hate speech detection: strengths and vulnerabilities

Suspiciousness of Adversarial Texts to Human

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

AugmenToxic: Leveraging Reinforcement Learning to Optimize LLM Instruction Fine-Tuning for Data Augmentation to Enhance Toxicity Detection

Protecting marginalized communities by mitigating discrimination in toxic language detection

Stumbling Blocks: Stress Testing the Robustness of Machine-Generated Text Detectors Under Attacks

Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models