Content Moderation by LLM: From Accuracy to Legitimacy

Tao Huang
2024-09-05
Abstract:One trending application of LLM (large language model) is to use it for content moderation in online platforms. Most current studies on this application have focused on the metric of accuracy - the extent to which LLM makes correct decisions about content. This article argues that accuracy is insufficient and misleading, because it fails to grasp the distinction between easy cases and hard cases as well as the inevitable trade-offs in achieving higher accuracy. Closer examination reveals that content moderation is a constitutive part of platform governance, the key of which is to gain and enhance legitimacy. Instead of making moderation decisions correct, the chief goal of LLM is to make them legitimate. In this regard, this article proposes a paradigm shift from the single benchmark of accuracy towards a legitimacy-based framework of evaluating the performance of LLM moderators. The framework suggests that for easy cases, the key is to ensure accuracy, speed and transparency, while for hard cases, what matters is reasoned justification and user participation. Examined under this framework, LLM's real potential in moderation is not accuracy improvement. Rather, LLM can better contribute in four other aspects: to conduct screening of hard cases from easy cases, to provide quality explanations for moderation decisions, to assist human reviewers in getting more contextual information, and to facilitate user participation in a more interactive way. Using normative theories from law and social sciences to critically assess the new technological application, this article seeks to redefine LLM's role in content moderation and redirect relevant research in this field.
Computers and Society,Artificial Intelligence,Emerging Technologies,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper attempts to address the issue that the application of large language models (LLMs) in content moderation should not only focus on accuracy but should also place greater emphasis on legitimacy. Specifically, the authors argue: 1. **Limitations of Accuracy**: Most current research primarily focuses on the accuracy of LLMs in content moderation, i.e., the extent to which LLMs can make correct moderation decisions. However, the authors believe that this single accuracy metric is limited and misleading. Content moderation is not just a technical issue but also involves balancing various rights and interests, which may be understood differently in different cultural contexts. 2. **Necessity of Legitimacy**: Content moderation is a crucial part of platform governance, with the core goal of gaining and enhancing legitimacy. Legitimacy includes not only accuracy but also transparency, speed, reasoning behind decisions, and procedural fairness. Therefore, the authors propose a shift from a single accuracy metric to a comprehensive evaluation framework based on legitimacy. 3. **Handling Different Case Types**: The authors distinguish between simple and complex cases and propose different legitimacy evaluation standards. For simple cases, the focus is on ensuring accuracy, speed, and transparency; for complex cases, the key is to provide reasonable explanations and user participation. 4. **True Potential of LLMs**: Under the new framework, the main advantages of LLMs are not in improving accuracy but in the following aspects: - Pre-screening complex cases - Providing high-quality decision explanations - Assisting human moderators in obtaining more contextual information - Facilitating more interactive user participation By repositioning the role of LLMs in content moderation, the authors hope to guide researchers and developers to shift their attention from merely pursuing accuracy to enhancing the overall legitimacy of platform governance. This not only helps to better utilize the technical advantages of LLMs but also more comprehensively addresses the complex challenges in content moderation.