Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales

Ayushi Nirmal,Amrita Bhattacharjee,Paras Sheth,Huan Liu
2024-05-08
Abstract:Although social media platforms are a prominent arena for users to engage in interpersonal discussions and express opinions, the facade and anonymity offered by social media may allow users to spew hate speech and offensive content. Given the massive scale of such platforms, there arises a need to automatically identify and flag instances of hate speech. Although several hate speech detection methods exist, most of these black-box methods are not interpretable or explainable by design. To address the lack of interpretability, in this paper, we propose to use state-of-the-art Large Language Models (LLMs) to extract features in the form of rationales from the input text, to train a base hate speech classifier, thereby enabling faithful interpretability by design. Our framework effectively combines the textual understanding capabilities of LLMs and the discriminative power of state-of-the-art hate speech classifiers to make these classifiers faithfully interpretable. Our comprehensive evaluation on a variety of English language social media hate speech datasets demonstrate: (1) the goodness of the LLM-extracted rationales, and (2) the surprising retention of detector performance even after training to ensure interpretability. All code and data will be made available at
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of hate speech detection on social media platforms, particularly by improving the interpretability and transparency of existing detection methods. Although there are various hate speech detection methods currently available, most of them are black-box models that lack interpretability. To address this shortcoming, the paper proposes a new framework that leverages large language models (LLMs) to extract rationales from the text to train a foundational hate speech classifier, thereby achieving faithful interpretability. ### Specific Problems and Solutions 1. **Problem Background**: - Social media platforms have become important venues for users to exchange opinions and express views, but due to their anonymity and surface-level protection, users may post hate speech and offensive content. - Given the vast scale of social media platforms, the need for automatic identification and labeling of hate speech is increasingly urgent. - Existing hate speech detection methods, while performing well in terms of performance, mostly lack interpretability, which is a significant drawback in sensitive tasks. 2. **Research Objectives**: - Propose a new framework that utilizes state-of-the-art large language models (LLMs) to extract rationales from input text to enhance the interpretability of foundational hate speech detection models. - Demonstrate the effectiveness of this framework through experiments, including the quality of the extracted rationales and the ability to maintain detection performance while ensuring interpretability. 3. **Solutions**: - **SHIELD Framework**: This framework uses LLMs to extract rationales from input text and employs these rationales to train a foundational hate speech detection model. The specific steps are as follows: - **LLM Feature Extractor**: Use a pre-trained large language model (such as GPT-3.5) to extract features and rationales related to hate speech from the input text. - **Hate Speech Detector**: Use a pre-trained HateBERT model as the foundational detector to extract embedding representations of the input text. - **Feature Embedding Model**: Use a pre-trained BERT model to embed the extracted rationales into vector space. - **Embedding Fusion and Classification**: Concatenate the embedding representations from the foundational detector and the feature embedding model, and then perform the final classification through a multi-layer perceptron (MLP). 4. **Experimental Results**: - Comprehensive evaluations were conducted on multiple English social media hate speech datasets, and the results show: - The quality of the rationales extracted by the LLM is good and highly consistent with human-annotated rationales. - Even while ensuring interpretability, the SHIELD framework is able to maintain or even improve detection performance. ### Summary By introducing the SHIELD framework, this paper successfully addresses the lack of interpretability in existing hate speech detection methods. By leveraging LLMs to extract rationales, the framework not only enhances the transparency of the model but also demonstrates good detection performance across multiple datasets. This provides new ideas and methods for future hate speech detection research.