Abstract:Video generation models (VGMs) have demonstrated the capability to synthesize high-quality output. It is important to understand their potential to produce unsafe content, such as violent or terrifying videos. In this work, we provide a comprehensive understanding of unsafe video generation. First, to confirm the possibility that these models could indeed generate unsafe videos, we choose unsafe content generation prompts collected from 4chan and Lexica, and three open-source SOTA VGMs to generate unsafe videos. After filtering out duplicates and poorly generated content, we created an initial set of 2112 unsafe videos from an original pool of 5607 videos. Through clustering and thematic coding analysis of these generated videos, we identify 5 unsafe video categories: Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, and Political. With IRB approval, we then recruit online participants to help label the generated videos. Based on the annotations submitted by 403 participants, we identified 937 unsafe videos from the initial video set. With the labeled information and the corresponding prompts, we created the first dataset of unsafe videos generated by VGMs. We then study possible defense mechanisms to prevent the generation of unsafe videos. Existing defense methods in image generation focus on filtering either input prompt or output results. We propose a new approach called Latent Variable Defense (LVD), which works within the model's internal sampling process. LVD can achieve 0.90 defense accuracy while reducing time and computing resources by 10x when sampling a large number of unsafe prompts.

What problem does this paper attempt to address?

The paper primarily explores the capabilities and understanding of Video Generation Models (VGMs) in producing potentially harmful content and proposes a defense mechanism to prevent the generation of such content. The core issue of the paper is to understand and address the problem of VGMs generating unsafe videos, such as those containing violence, terror, or pornographic content. The researchers began by collecting unsafe content generation prompts from the 4chan and Lexica websites, as well as using three open-source state-of-the-art VGMs, to create an initial collection of unsafe videos. After filtering and analysis, they identified five categories of unsafe videos: distorted/weird, terror, pornographic, violent/bloody, and politically related. Subsequently, these videos were labeled by online participants to further confirm their unsafety, and a labeled dataset containing 937 unsafe videos was created. To address the generation of unsafe videos, the paper proposes the Latent Variable Defense (LVD) method, a defense strategy that works during the internal sampling process of the model, unlike existing input filtering or output result filtering methods. LVD leverages the deterministic properties within diffusion models, detecting whether the generated videos are unsafe by analyzing intermediate results. Experiments show that LVD performs excellently in detection accuracy and efficiency, able to increase defense accuracy to 0.90 when faced with a large number of unsafe prompts, while reducing time and computational resource consumption by a factor of 10. Moreover, LVD also achieved near-perfect accuracy in tests against adversarial prompts and in image-to-video diffusion models, demonstrating its versatility and ability to work in conjunction with other defense mechanisms. In summary, the paper not only demonstrates the potential for VGMs to generate unsafe content but also proposes an effective defense framework, LVD, aimed at ensuring that VGMs do not produce harmful content, thereby contributing to the safe development of artificial intelligence.

Towards Understanding Unsafe Video Generation

T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models

VGMShield: Mitigating Misuse of Video Generative Models

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Multi Frame Obscene Video Detection with ViT

Moderating Illicit Online Image Promotion for Unsafe User-Generated Content Games Using Large Vision-Language Models

SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models

To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now

On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts

ToViLaG: Your Visual-Language Generative Model is Also An Evildoer

When Image Generation Goes Wrong: A Safety Analysis of Stable Diffusion Models

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Attention Shift: Steering AI Away from Unsafe Content

A Survey of Defenses against AI-generated Visual Media: Detection, Disruption, and Authentication

Generating Natural Language Adversarial Examples on a Large Scale with Generative Models

Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding

Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global Temporal Defect Based Detection Method

Safety and Fairness for Content Moderation in Generative Models

SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset

Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction