Abstract:Recently efforts have been made by social media platforms as well as researchers to detect hateful or toxic language using large language models. However, none of these works aim to use explanation, additional context and victim community information in the detection process. We utilise different prompt variation, input information and evaluate large language models in zero shot setting (without adding any in-context examples). We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - HateXplain, implicit hate and ToxicSpans. We find that on average including the target information in the pipeline improves the model performance substantially (~20-30%) over the baseline across the datasets. There is also a considerable effect of adding the rationales/explanations into the pipeline (~10-20%) over the baseline across the datasets. In addition, we further provide a typology of the error cases where these large language models fail to (i) classify and (ii) explain the reason for the decisions they take. Such vulnerable points automatically constitute 'jailbreak' prompts for these models and industry scale safeguard techniques need to be developed to make the models robust against such prompts.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to improve the detection of hate speech by leveraging large language models (LLMs) combined with explanations, additional contextual information, and information about the victimized community. Specifically, the researchers evaluated the performance of three large language models (GPT-3.5, text-davinci, and Flan-T5) on three datasets (HateXplain, Implicit Hate, and ToxicSpans) in a zero-shot setting using different prompt variants and input information. They found that including target information and explanation information can significantly improve model performance and also analyzed the types of errors these models make in classification and explanation decisions, thereby identifying the models' vulnerabilities and areas for improvement. ### Main Research Questions: 1. **How to use explanations and additional contextual information to improve the effectiveness of hate speech detection**: The researchers explored how to use explanations and additional contextual information to improve the performance of large language models in hate speech detection tasks through different prompt variants and input information. 2. **Model vulnerabilities and error types**: The researchers conducted a detailed analysis of common errors in the models' classification and explanation decisions, proposing potential "jailbreak" prompts these errors might constitute and highlighting the need to develop industry-scale safety technologies to enhance model robustness. ### Research Background: - **Social Issue**: Hate speech and toxic content on online social media have become a persistent problem, leading to harassment, abuse, and cyberbullying of many users, and even triggering violent incidents. - **Limitations of Existing Methods**: Existing content moderation methods often rely on manual annotation, which is resource-intensive and can cause psychological burden to annotators. Therefore, researchers have begun exploring the possibility of using large language models to automatically detect hate speech. ### Research Methods: - **Datasets**: The researchers used three datasets: HateXplain, Implicit Hate, and ToxicSpans, which contain detailed labels and explanation information. - **Models**: The researchers selected three large language models: GPT-3.5, text-davinci, and Flan-T5. - **Prompt Variants**: The researchers designed various prompt variants, including those containing only hate posts, definitions, target community information, and explanations. ### Main Findings: - **Prompts containing target information and explanation information significantly improved model performance**: On multiple datasets, prompts containing target information and explanation information significantly improved model performance, especially on the HateXplain and ToxicSpans datasets. - **Model vulnerabilities**: The researchers found that models are prone to errors when dealing with implicit hate and non-hate content, particularly in the presence of sensitive or controversial terms, negations, and words expressing support. - **Error types**: Common errors in the models' classification and explanation decisions include misclassifying non-hate content as implicit hate and misclassifying normal content as offensive content. ### Conclusion: - **Importance of target information and explanation information**: The research results indicate that including target information and explanation information can significantly improve the performance of large language models in hate speech detection tasks. - **Future Research Directions**: Future research should focus on further enhancing model robustness, particularly by developing safety technologies to address the models' vulnerabilities.

Probing LLMs for hate speech detection: strengths and vulnerabilities

Decoding Hate: Exploring Language Models' Reactions to Hate Speech

An Investigation of Large Language Models for Real-World Hate Speech Detection

Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales

Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

Hate Personified: Investigating the role of LLMs in content moderation

HateRephrase: Zero- and Few-Shot Reduction of Hate Intensity in Online Posts using Large Language Models

Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection

Incorporating Human Explanations for Robust Hate Speech Detection

Efficient Models for the Detection of Hate, Abuse and Profanity

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

Recent Advances in Hate Speech Moderation: Multimodality and the Role of Large Models

HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection

HateTinyLLM : Hate Speech Detection Using Tiny Large Language Models

$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker's Ethnicity on Hate Classification

Exploring Transformer Based Models to Identify Hate Speech and Offensive Content in English and Indo-Aryan Languages

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Watch Your Language: Investigating Content Moderation with Large Language Models

HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models

Probing Critical Learning Dynamics of PLMs for Hate Speech Detection

HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning