Probing LLMs for hate speech detection: strengths and vulnerabilities

Sarthak Roy,Ashish Harshavardhan,Animesh Mukherjee,Punyajoy Saha
2023-10-28
Abstract:Recently efforts have been made by social media platforms as well as researchers to detect hateful or toxic language using large language models. However, none of these works aim to use explanation, additional context and victim community information in the detection process. We utilise different prompt variation, input information and evaluate large language models in zero shot setting (without adding any in-context examples). We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - HateXplain, implicit hate and ToxicSpans. We find that on average including the target information in the pipeline improves the model performance substantially (~20-30%) over the baseline across the datasets. There is also a considerable effect of adding the rationales/explanations into the pipeline (~10-20%) over the baseline across the datasets. In addition, we further provide a typology of the error cases where these large language models fail to (i) classify and (ii) explain the reason for the decisions they take. Such vulnerable points automatically constitute 'jailbreak' prompts for these models and industry scale safeguard techniques need to be developed to make the models robust against such prompts.
Computation and Language,Computers and Society
What problem does this paper attempt to address?
This paper attempts to address the problem of how to improve the detection of hate speech by leveraging large language models (LLMs) combined with explanations, additional contextual information, and information about the victimized community. Specifically, the researchers evaluated the performance of three large language models (GPT-3.5, text-davinci, and Flan-T5) on three datasets (HateXplain, Implicit Hate, and ToxicSpans) in a zero-shot setting using different prompt variants and input information. They found that including target information and explanation information can significantly improve model performance and also analyzed the types of errors these models make in classification and explanation decisions, thereby identifying the models' vulnerabilities and areas for improvement. ### Main Research Questions: 1. **How to use explanations and additional contextual information to improve the effectiveness of hate speech detection**: The researchers explored how to use explanations and additional contextual information to improve the performance of large language models in hate speech detection tasks through different prompt variants and input information. 2. **Model vulnerabilities and error types**: The researchers conducted a detailed analysis of common errors in the models' classification and explanation decisions, proposing potential "jailbreak" prompts these errors might constitute and highlighting the need to develop industry-scale safety technologies to enhance model robustness. ### Research Background: - **Social Issue**: Hate speech and toxic content on online social media have become a persistent problem, leading to harassment, abuse, and cyberbullying of many users, and even triggering violent incidents. - **Limitations of Existing Methods**: Existing content moderation methods often rely on manual annotation, which is resource-intensive and can cause psychological burden to annotators. Therefore, researchers have begun exploring the possibility of using large language models to automatically detect hate speech. ### Research Methods: - **Datasets**: The researchers used three datasets: HateXplain, Implicit Hate, and ToxicSpans, which contain detailed labels and explanation information. - **Models**: The researchers selected three large language models: GPT-3.5, text-davinci, and Flan-T5. - **Prompt Variants**: The researchers designed various prompt variants, including those containing only hate posts, definitions, target community information, and explanations. ### Main Findings: - **Prompts containing target information and explanation information significantly improved model performance**: On multiple datasets, prompts containing target information and explanation information significantly improved model performance, especially on the HateXplain and ToxicSpans datasets. - **Model vulnerabilities**: The researchers found that models are prone to errors when dealing with implicit hate and non-hate content, particularly in the presence of sensitive or controversial terms, negations, and words expressing support. - **Error types**: Common errors in the models' classification and explanation decisions include misclassifying non-hate content as implicit hate and misclassifying normal content as offensive content. ### Conclusion: - **Importance of target information and explanation information**: The research results indicate that including target information and explanation information can significantly improve the performance of large language models in hate speech detection tasks. - **Future Research Directions**: Future research should focus on further enhancing model robustness, particularly by developing safety technologies to address the models' vulnerabilities.