Evaluating GPT-3 Generated Explanations for Hateful Content Moderation

Han Wang,Ming Shan Hee,Md Rabiul Awal,Kenny Tsu Wei Choo,Roy Ka-Wei Lee
2023-08-31
Abstract:Recent research has focused on using large language models (LLMs) to generate explanations for hate speech through fine-tuning or prompting. Despite the growing interest in this area, these generated explanations' effectiveness and potential limitations remain poorly understood. A key concern is that these explanations, generated by LLMs, may lead to erroneous judgments about the nature of flagged content by both users and content moderators. For instance, an LLM-generated explanation might inaccurately convince a content moderator that a benign piece of content is hateful. In light of this, we propose an analytical framework for examining hate speech explanations and conducted an extensive survey on evaluating such explanations. Specifically, we prompted GPT-3 to generate explanations for both hateful and non-hateful content, and a survey was conducted with 2,400 unique respondents to evaluate the generated explanations. Our findings reveal that (1) human evaluators rated the GPT-generated explanations as high quality in terms of linguistic fluency, informativeness, persuasiveness, and logical soundness, (2) the persuasive nature of these explanations, however, varied depending on the prompting strategy employed, and (3) this persuasiveness may result in incorrect judgments about the hatefulness of the content. Our study underscores the need for caution in applying LLM-generated explanations for content moderation. Code and results are available at <a class="link-external link-https" href="https://github.com/Social-AI-Studio/GPT3-HateEval" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate the effectiveness and potential limitations of explanations generated by large language models (LLMs) in hate speech moderation. Specifically, the researchers focus on whether these LLM-generated explanations might lead users and content moderators to misjudge the nature of flagged content. For example, an explanation generated by an LLM might incorrectly convince a content moderator that a harmless piece of content is hate speech. To explore this issue, the authors propose an analytical framework and conduct an extensive survey to evaluate explanations generated by GPT-3. They prompt GPT-3 to generate explanations for both hate and non-hate content and have 2,400 independent respondents assess the quality of these explanations. The primary goal of the study is to answer the following three research questions: 1. **RQ1**: How do GPT-3-generated explanations for hate content perform in terms of fluency, informativeness, persuasiveness, and logical coherence? 2. **RQ2**: How persuasive are GPT-3-generated explanations, and do different prompting strategies affect their persuasiveness? 3. **RQ3**: Does using GPT-3-generated explanations lead to erroneous decisions in hate content moderation? ### Key Findings 1. **Explanation Quality**: - Human evaluators found that GPT-3-generated explanations performed well in terms of fluency, informativeness, persuasiveness, and logical coherence. - Different prompting strategies yield varying levels of persuasiveness. For instance, prompting GPT-3 to explain why a tweet is hate speech results in more persuasive explanations than merely asking for contextual information. - The length of the explanation also affects its persuasiveness. 2. **Persuasiveness**: - GPT-3-generated explanations may lead human evaluators to misclassify about 20% of tweets, either by incorrectly identifying non-hate tweets as hate speech or vice versa. - Providing both hate and non-hate explanations can mitigate the risk of misleading content moderators. 3. **Risks and Challenges**: - There is a risk of GPT-3-generated explanations misleading human moderators, especially when evaluating hate content. - The study emphasizes the need for caution when applying LLM-generated explanations in content moderation. ### Conclusion The study reveals both the potential and limitations of GPT-3 in generating explanations for hate speech. While the generated explanations perform well in certain aspects, they can also lead to erroneous judgments. Therefore, the research suggests that these generated explanations should be used cautiously in practical applications and that combining multiple explanation strategies may help reduce the risk of misjudgment.