Abstract:The widespread dissemination of hate speech, harassment, harmful and sexual content, and violence across websites and media platforms presents substantial challenges and provokes widespread concern among different sectors of society. Governments, educators, and parents are often at odds with media platforms about how to regulate, control, and limit the spread of such content. Technologies for detecting and censoring the media contents are a key solution to addressing these challenges. Techniques from natural language processing and computer vision have been used widely to automatically identify and filter out sensitive content such as offensive languages, violence, nudity, and addiction in both text, images, and videos, enabling platforms to enforce content policies at scale. However, existing methods still have limitations in achieving high detection accuracy with fewer false positives and false negatives. Therefore, more sophisticated algorithms for understanding the context of both text and image may open rooms for improvement in content censorship to build a more efficient censorship system. In this paper, we evaluate existing LLM-based content moderation solutions such as OpenAI moderation model and Llama-Guard3 and study their capabilities to detect sensitive contents. Additionally, we explore recent LLMs such as GPT, Gemini, and Llama in identifying inappropriate contents across media outlets. Various textual and visual datasets like X tweets, Amazon reviews, news articles, human photos, cartoons, sketches, and violence videos have been utilized for evaluation and comparison. The results demonstrate that LLMs outperform traditional techniques by achieving higher accuracy and lower false positive and false negative rates. This highlights the potential to integrate LLMs into websites, social media platforms, and video-sharing services for regulatory and content moderation purposes.

Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework

LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual Predatory Chats and Abusive Texts

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

A Framework for Real-time Safeguarding the Text Generation of Large Language Model

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

SLM as Guardian: Pioneering AI Safety with Small Language Models

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

Global Challenge for Safe and Secure LLMs Track 1

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

InSaAF: Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain?

Adaptable and Reliable Text Classification using Large Language Models