NoisyHate: Benchmarking Content Moderation Machine Learning Models with Human-Written Perturbations Online

Yiran Ye,Thai Le,Dongwon Lee
2023-03-18
Abstract:Online texts with toxic content are a threat in social media that might cause cyber harassment. Although many platforms applied measures, such as machine learning-based hate-speech detection systems, to diminish their effect, those toxic content publishers can still evade the system by modifying the spelling of toxic words. Those modified words are also known as human-written text perturbations. Many research works developed certain techniques to generate adversarial samples to help the machine learning models obtain the ability to recognize those perturbations. However, there is still a gap between those machine-generated perturbations and human-written perturbations. In this paper, we introduce a benchmark test set containing human-written perturbations online for toxic speech detection models. We also recruited a group of workers to evaluate the quality of this test set and dropped low-quality samples. Meanwhile, to check if our perturbation can be normalized to its clean version, we applied spell corrector algorithms on this dataset. Finally, we test this data on state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as perspective API, to demonstrate the adversarial attack with real human-written perturbations is still effective.
Machine Learning,Artificial Intelligence,Computation and Language,Computers and Society
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the issue of detecting toxic content (such as hate speech) on social media platforms. Although many platforms have already implemented various measures (e.g., machine learning-based hate speech detection systems) to reduce the impact of such toxic content, malicious users circumvent these detection systems by cleverly modifying toxic words (i.e., human-written text perturbations). Therefore, to help build AI detection systems capable of recognizing these perturbations, existing methods have developed sophisticated techniques to generate diverse adversarial examples. However, the perturbations generated by these algorithms do not necessarily capture all the characteristics of human-written perturbations. To address this issue, the paper introduces a benchmark dataset called **NoisyHate**, which contains perturbations created from real human-written perturbations on various social platforms, to help develop better toxic speech detection models. Additionally, the paper tests state-of-the-art language models (such as BERT and RoBERTa) and commercial toxic content detection APIs (such as Perspective API) to demonstrate the effectiveness of adversarial attacks using real human-written perturbations. ### Main Contributions 1. **Introduction of the novel benchmark dataset NoisyHate**: This dataset contains online human-written perturbations for toxic speech detection models. 2. **Testing the NoisyHate dataset with various spell checkers**: Demonstrates the importance of developing better normalization tools for these online human-written perturbations. 3. **Evaluation of state-of-the-art language models and commercial toxic detection APIs**: Reveals that these models still have room for improvement when predicting texts containing human-written perturbations.