Abstract:Hate speech has become pervasive in today's digital age. Although there has been considerable research to detect hate speech or generate counter speech to combat hateful views, these approaches still cannot completely eliminate the potential harmful societal consequences of hate speech -- hate speech, even when detected, can often not be taken down or is often not taken down enough; and hate speech unfortunately spreads quickly, often much faster than any generated counter speech. This paper investigates a relatively new yet simple and effective approach of suggesting a rephrasing of potential hate speech content even before the post is made. We show that Large Language Models (LLMs) perform well on this task, outperforming state-of-the-art baselines such as BART-Detox. We develop 4 different prompts based on task description, hate definition, few-shot demonstrations and chain-of-thoughts for comprehensive experiments and conduct experiments on open-source LLMs such as LLaMA-1, LLaMA-2 chat, Vicuna as well as OpenAI's GPT-3.5. We propose various evaluation metrics to measure the efficacy of the generated text and ensure the generated text has reduced hate intensity without drastically changing the semantic meaning of the original text. We find that LLMs with a few-shot demonstrations prompt work the best in generating acceptable hate-rephrased text with semantic meaning similar to the original text. Overall, we find that GPT-3.5 outperforms the baseline and open-source models for all the different kinds of prompts. We also perform human evaluations and interestingly, find that the rephrasings generated by GPT-3.5 outperform even the human-generated ground-truth rephrasings in the dataset. We also conduct detailed ablation studies to investigate why LLMs work satisfactorily on this task and conduct a failure analysis to understand the gaps.

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection

Probing LLMs for hate speech detection: strengths and vulnerabilities

Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Incorporating Human Explanations for Robust Hate Speech Detection

Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales

OffLanDat: A Community Based Implicit Offensive Language Dataset Generated by Large Language Model Through Prompt Engineering

HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter

Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models

HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning

Explainable and High-Performance Hate and Offensive Speech Detection

Transfer Learning for Hate Speech Detection in Social Media

Multi3Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision-Language Models

Hate Speech Detection Using Cross-Platform Social Media Data In English and German Language

A Text-to-Text Model for Multilingual Offensive Language Identification

HateTinyLLM : Hate Speech Detection Using Tiny Large Language Models

"It's Not Just Hate'': A Multi-Dimensional Perspective on Detecting Harmful Speech Online

Leveraging external resources for offensive content detection in social media

HateRephrase: Zero- and Few-Shot Reduction of Hate Intensity in Online Posts using Large Language Models

Decoding Hate: Exploring Language Models' Reactions to Hate Speech

An Investigation of Large Language Models for Real-World Hate Speech Detection