Abstract:Although social media platforms are a prominent arena for users to engage in interpersonal discussions and express opinions, the facade and anonymity offered by social media may allow users to spew hate speech and offensive content. Given the massive scale of such platforms, there arises a need to automatically identify and flag instances of hate speech. Although several hate speech detection methods exist, most of these black-box methods are not interpretable or explainable by design. To address the lack of interpretability, in this paper, we propose to use state-of-the-art Large Language Models (LLMs) to extract features in the form of rationales from the input text, to train a base hate speech classifier, thereby enabling faithful interpretability by design. Our framework effectively combines the textual understanding capabilities of LLMs and the discriminative power of state-of-the-art hate speech classifiers to make these classifiers faithfully interpretable. Our comprehensive evaluation on a variety of English language social media hate speech datasets demonstrate: (1) the goodness of the LLM-extracted rationales, and (2) the surprising retention of detector performance even after training to ensure interpretability. All code and data will be made available at

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of hate speech detection on social media platforms, particularly by improving the interpretability and transparency of existing detection methods. Although there are various hate speech detection methods currently available, most of them are black-box models that lack interpretability. To address this shortcoming, the paper proposes a new framework that leverages large language models (LLMs) to extract rationales from the text to train a foundational hate speech classifier, thereby achieving faithful interpretability. ### Specific Problems and Solutions 1. **Problem Background**: - Social media platforms have become important venues for users to exchange opinions and express views, but due to their anonymity and surface-level protection, users may post hate speech and offensive content. - Given the vast scale of social media platforms, the need for automatic identification and labeling of hate speech is increasingly urgent. - Existing hate speech detection methods, while performing well in terms of performance, mostly lack interpretability, which is a significant drawback in sensitive tasks. 2. **Research Objectives**: - Propose a new framework that utilizes state-of-the-art large language models (LLMs) to extract rationales from input text to enhance the interpretability of foundational hate speech detection models. - Demonstrate the effectiveness of this framework through experiments, including the quality of the extracted rationales and the ability to maintain detection performance while ensuring interpretability. 3. **Solutions**: - **SHIELD Framework**: This framework uses LLMs to extract rationales from input text and employs these rationales to train a foundational hate speech detection model. The specific steps are as follows: - **LLM Feature Extractor**: Use a pre-trained large language model (such as GPT-3.5) to extract features and rationales related to hate speech from the input text. - **Hate Speech Detector**: Use a pre-trained HateBERT model as the foundational detector to extract embedding representations of the input text. - **Feature Embedding Model**: Use a pre-trained BERT model to embed the extracted rationales into vector space. - **Embedding Fusion and Classification**: Concatenate the embedding representations from the foundational detector and the feature embedding model, and then perform the final classification through a multi-layer perceptron (MLP). 4. **Experimental Results**: - Comprehensive evaluations were conducted on multiple English social media hate speech datasets, and the results show: - The quality of the rationales extracted by the LLM is good and highly consistent with human-annotated rationales. - Even while ensuring interpretability, the SHIELD framework is able to maintain or even improve detection performance. ### Summary By introducing the SHIELD framework, this paper successfully addresses the lack of interpretability in existing hate speech detection methods. By leveraging LLMs to extract rationales, the framework not only enhances the transparency of the model but also demonstrates good detection performance across multiple datasets. This provides new ideas and methods for future hate speech detection research.

Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales

Towards Efficient and Explainable Hate Speech Detection via Model Distillation

Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection

Probing LLMs for hate speech detection: strengths and vulnerabilities

Incorporating Human Explanations for Robust Hate Speech Detection

HARE: Explainable Hate Speech Detection with Step-by-Step Reasoning

Decoding Hate: Exploring Language Models' Reactions to Hate Speech

An Investigation of Large Language Models for Real-World Hate Speech Detection

Interpretable Multi-Modal Hate Speech Detection

HateTinyLLM : Hate Speech Detection Using Tiny Large Language Models

Towards an Intrinsic Interpretability Approach for Multimodal Hate Speech Detection

Identifying Hate Speech Peddlers in Online Platforms. A Bayesian Social Learning Approach for Large Language Model Driven Decision-Makers

HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection

$\textit{Who Speaks Matters}$: Analysing the Influence of the Speaker's Ethnicity on Hate Classification

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Causality Guided Disentanglement for Cross-Platform Hate Speech Detection

Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models

Enhancing Multilingual Hate Speech Detection: From Language-Specific Insights to Cross-Linguistic Integration

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

Highly Generalizable Models for Multilingual Hate Speech Detection

StereoHate: Toward identifying stereotypical bias and target group in hate speech detection