Incorporating Human Explanations for Robust Hate Speech Detection

Jennifer L. Chen,Faisal Ladhak,Daniel Li,Noémie Elhadad
2024-11-09
Abstract:Given the black-box nature and complexity of large transformer language models (LM), concerns about generalizability and robustness present ethical implications for domains such as hate speech (HS) detection. Using the content rich Social Bias Frames dataset, containing human-annotated stereotypes, intent, and targeted groups, we develop a three stage analysis to evaluate if LMs faithfully assess hate speech. First, we observe the need for modeling contextually grounded stereotype intents to capture implicit semantic meaning. Next, we design a new task, Stereotype Intent Entailment (SIE), which encourages a model to contextually understand stereotype presence. Finally, through ablation tests and user studies, we find a SIE objective improves content understanding, but challenges remain in modeling implicit intent.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of generalization ability and robustness in using large - language models (LMs) for hate - speech (HS) detection. Specifically, the author focuses on how to improve the model's ability to understand the context when detecting hate speech, especially how to capture the implicit social - bias frameworks (such as stereotypes and their intentions). By introducing human explanations, that is, through the stereotype intentions in the social - bias framework dataset, to enhance the model's semantic - alignment ability, thereby improving the model's robustness and transparency. To achieve this goal, the author proposes a new task - Stereotype Intent Entailment (SIE), which aims to encourage the model to understand the existence of stereotypes from the context. Through comparative experiments and user studies, the author finds that the SIE task can improve the model's content - understanding ability, especially its performance under adversarial attacks. However, the model still faces challenges in modeling implicit intentions.