Abstract:Addressing the challenge of toxic language in online discussions is crucial for the development of effective toxicity detection models. This pioneering work focuses on addressing imbalanced datasets in toxicity detection by introducing a novel approach to augment toxic language data. We create a balanced dataset by instructing fine-tuning of Large Language Models (LLMs) using Reinforcement Learning with Human Feedback (RLHF). Recognizing the challenges in collecting sufficient toxic samples from social media platforms for building a balanced dataset, our methodology involves sentence-level text data augmentation through paraphrasing existing samples using optimized generative LLMs. Leveraging generative LLM, we utilize the Proximal Policy Optimizer (PPO) as the RL algorithm to fine-tune the model further and align it with human feedback. In other words, we start by fine-tuning a LLM using an instruction dataset, specifically tailored for the task of paraphrasing while maintaining semantic consistency. Next, we apply PPO and a reward function, to further fine-tune (optimize) the instruction-tuned LLM. This RL process guides the model in generating toxic responses. We utilize the Google Perspective API as a toxicity evaluator to assess generated responses and assign rewards/penalties accordingly. This approach guides LLMs through PPO and the reward function, transforming minority class samples into augmented versions. The primary goal of our methodology is to create a balanced and diverse dataset to enhance the accuracy and performance of classifiers in identifying instances from the minority class. Utilizing two publicly available toxic datasets, we compared various techniques with our proposed method for generating toxic samples, demonstrating that our approach outperforms all others in producing a higher number of toxic samples. Starting with an initial 16,225 toxic prompts, our method successfully generated 122,951 toxic samples with a toxicity score exceeding 30%. Subsequently, we developed various classifiers using the generated balanced datasets and applied a cost-sensitive learning approach to the original imbalanced dataset. The findings highlight the superior performance of classifiers trained on data generated using our proposed method. These results highlight the importance of employing RL and a data-agnostic model as a reward mechanism for augmenting toxic data, thereby enhancing the robustness of toxicity detection models.

LLM-Based Synthetic Datasets: Applications and Limitations in Toxicity Detection

AugmenToxic: Leveraging Reinforcement Learning to Optimize LLM Instruction Fine-Tuning for Data Augmentation to Enhance Toxicity Detection

Can Open-source LLMs Enhance Data Synthesis for Toxic Detection?: An Experimental Study

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Realistic Evaluation of Toxicity in Large Language Models

Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Can LLMs Recognize Toxicity? Structured Toxicity Investigation Framework and Semantic-Based Metric

Mitigating Toxic Degeneration with Empathetic Data: Exploring the Relationship Between Toxicity and Empathy

Challenges in Detoxifying Language Models

Exploring LLMs as a Source of Targeted Synthetic Textual Data to Minimize High Confidence Misclassifications

A Target-Aware Analysis of Data Augmentation for Hate Speech Detection

Unveiling the Implicit Toxicity in Large Language Models

Generative AI for Hate Speech Detection: Evaluation and Findings

Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection

Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation

DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances

ToxiCraft: A Novel Framework for Synthetic Generation of Harmful Information

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety