Abstract:Toxicity annotators and content moderators often default to mental shortcuts when making decisions. This can lead to subtle toxicity being missed, and seemingly toxic but harmless content being over-detected. We introduce BiasX, a framework that enhances content moderation setups with free-text explanations of statements' implied social biases, and explore its effectiveness through a large-scale crowdsourced user study. We show that indeed, participants substantially benefit from explanations for correctly identifying subtly (non-)toxic content. The quality of explanations is critical: imperfect machine-generated explanations (+2.4% on hard toxic examples) help less compared to expert-written human explanations (+7.2%). Our results showcase the promise of using free-text explanations to encourage more thoughtful toxicity moderation.
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve
The paper aims to address the issues of cognitive shortcuts and biases in the online content moderation process. Specifically, content moderators often rely on psychological shortcuts, cognitive biases, and heuristics when dealing with potentially toxic, offensive, or biased content. This can lead to subtle toxic content being overlooked, while seemingly toxic but harmless content is over-detected. These issues not only hinder the fair experience of minority groups but also affect the overall quality of online platforms.
To mitigate these problems, the authors propose a framework called BIASX. This framework enhances the content moderation process by providing free-text explanations of potential social biases, helping moderators to think more deeply about the potential biases and subtle toxicity in statements. Through large-scale crowdsourced user studies, the authors evaluate the effectiveness of BIASX and demonstrate the potential of free-text explanations in improving content moderation quality.
### Main Research Questions
1. **When and how do free-text explanations improve content moderation quality?**
2. **Is the explanation format of BIASX effective in helping moderators think more carefully about their moderation decisions?**
3. **How does the quality of explanations affect their effectiveness?**
### Experimental Design
To answer the above questions, the authors designed a crowdsourced user study simulating a real content moderation environment. Participants were randomly assigned to four condition groups, with each group required to label the toxicity of 30 online posts, including simple examples, difficult toxic examples, and difficult non-toxic examples. Different condition groups were provided with different types of explanation aids:
- **NO-EXPL**: Participants do not see any explanations.
- **LIGHT-EXPL**: Only the target group is provided as an explanation.
- **MODEL-EXPL**: Machine-generated explanations are provided (which may be imperfect).
- **HUMAN-EXPL**: High-quality explanations manually written by experts are provided.
### Results and Discussion
1. **BIASX improved moderation quality, especially on difficult toxic examples.**
- Figure 2a shows that HUMAN-EXPL increased accuracy by 7.2% on difficult toxic examples compared to the NO-EXPL baseline, by 7.7% on difficult non-toxic examples, and overall accuracy by 4.7%. This indicates that explicitly pointing out implicit biases or prejudices in statements indeed encourages content moderators to think more thoroughly about the toxicity of posts.
- Figure 4a presents a specific example illustrating that even imperfect explanations can significantly improve moderator performance.
2. **The designed explanation format effectively promoted more thorough decision-making.**
- Although BIASX increased the amount of text moderators needed to read and process, it did not significantly increase moderation time compared to LIGHT-EXPL, which only provided the target group. This suggests that providing detailed explanations does not substantially increase the cognitive load on moderators.
3. **The quality of explanations matters.**
- Compared to expert-written explanations, machine-generated explanations had a more complex impact on moderator performance. The main reason is that machine-generated explanations may be imperfect. Table 1 shows that 60% of machine explanations were accurate in difficult toxic examples, leading to a moderator accuracy of 56.4%, which is 7.7% lower than the accuracy under the HUMAN-EXPL condition.
- However, even with imperfect explanations, expert-written explanations still significantly improved moderator performance, demonstrating the potential of high-quality explanations.
### Conclusion and Future Work
This study proposes the BIASX framework, which assists content moderators by providing AI-generated explanations to help them think more thoroughly about their decisions. Experimental results show that adding explanations can significantly improve moderator performance on difficult toxic examples, especially high-quality expert explanations. This framework provides a proof of concept for future human-machine collaborative content moderation research, emphasizing the importance of explaining task-specific difficulties (such as subtle biases) in free text.
### Limitations, Ethical Considerations, and Broader Impact
- **Language and Cultural Limitations**: The current study is limited to English and a US-centric perspective. Future work could explore extending BIASX to other languages and communities.
- **Sample Selection**: The study used 30 carefully selected examples. Future research could expand the study by constructing higher-quality datasets.
- **Differences in Moderator Backgrounds**: The political orientation of moderators