Abstract:Content moderation typically combines the efforts of human moderators and machine learning <a class="link-external link-http" href="http://models.However" rel="external noopener nofollow">this http URL</a>, these systems often rely on data where significant disagreement occurs during moderation, reflecting the subjective nature of toxicity <a class="link-external link-http" href="http://perception.Rather" rel="external noopener nofollow">this http URL</a> than dismissing this disagreement as noise, we interpret it as a valuable signal that highlights the inherent ambiguity of the content,an insight missed when only the majority label is <a class="link-external link-http" href="http://considered.In" rel="external noopener nofollow">this http URL</a> this work, we introduce a novel content moderation framework that emphasizes the importance of capturing annotation disagreement. Our approach uses multitask learning, where toxicity classification serves as the primary task and annotation disagreement is addressed as an auxiliary <a class="link-external link-http" href="http://task.Additionally" rel="external noopener nofollow">this http URL</a>, we leverage uncertainty estimation techniques, specifically Conformal Prediction, to account for both the ambiguity in comment annotations and the model's inherent uncertainty in predicting toxicity and <a class="link-external link-http" href="http://disagreement.The" rel="external noopener nofollow">this http URL</a> framework also allows moderators to adjust thresholds for annotation disagreement, offering flexibility in determining when ambiguity should trigger a <a class="link-external link-http" href="http://review.We" rel="external noopener nofollow">this http URL</a> demonstrate that our joint approach enhances model performance, calibration, and uncertainty estimation, while offering greater parameter efficiency and improving the review process in comparison to single-task methods.

Unveiling disguised toxicity: A novel pre-processing module for enhanced content moderation

Fortifying Toxic Speech Detectors Against Veiled Toxicity

Towards Robust Toxic Content Classification

DeMod: A Holistic Tool with Explainable Detection and Personalized Modification for Toxicity Censorship

RECAST: Enabling User Recourse and Interpretability of Toxicity Detection Models with Interactive Visualization

RECAST: Interactive Auditing of Automatic Toxicity Detection Models

Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts

ToxiSpanSE: An Explainable Toxicity Detection in Code Review Comments

Mitigating Text Toxicity with Counterfactual Generation

Toxicity Detection for Indic Multilingual Social Media Content

Text Detoxification using Large Pre-trained Neural Models

Facilitating Fine-grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks

A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement

Impact of Sentiment Detection to Recognize Toxic and Subversive Online Comments

Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes

ToXCL: A Unified Framework for Toxic Speech Detection and Explanation

ToxCCIn: Toxic Content Classification with Interpretability

Leveraging Large Language Models and Topic Modeling for Toxicity Classification

ToxiCraft: A Novel Framework for Synthetic Generation of Harmful Information

A Critical Reflection on the Use of Toxicity Detection Algorithms in Proactive Content Moderation Systems

Protecting marginalized communities by mitigating discrimination in toxic language detection