Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos

Qingyuan Liu,Pengyuan Shi,Yun-Yun Tsai,Chengzhi Mao,Junfeng Yang

2024-06-14

Abstract:The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. However, the robustness of these detectors on diffusion-generated videos generated from video creation tools (e.g., SORA by OpenAI, Runway Gen-2, and Pika, etc.) is still unexplored. In this paper, we propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models, such as Stable Video Diffusion. We find that the SOTA methods for detecting diffusion-generated images lack robustness in identifying diffusion-generated videos. Our analysis reveals that the effectiveness of these detectors diminishes when applied to out-of-domain videos, primarily because they struggle to track the temporal features and dynamic variations between frames. To address the above-mentioned challenge, we collect a new benchmark video dataset for diffusion-generated videos using SOTA video creation tools. We extract representation within explicit knowledge from the diffusion model for video frames and train our detector with a CNN + LSTM architecture. The evaluation shows that our framework can well capture the temporal features between frames, achieves 93.7% detection accuracy for in-domain videos, and improves the accuracy of out-domain videos by up to 16 points.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem addressed in this paper is the limitations of video detection for artificially generated videos, especially for diffusion model-generated videos. Although existing detection methods can effectively identify Deepfake videos based on Generative Adversarial Networks (GANs), they perform poorly in detecting diffusion-generated videos produced by tools like OpenAI's SORA, Runway Gen-2, and Pika. The paper points out that these detectors experience a decline in performance when dealing with cross-domain videos due to their inability to capture spatiotemporal features and dynamic changes between frames. To address this issue, the paper proposes a new framework called DIVID for detecting synthesized videos from various advanced generation models, such as Stable Video Diffusion. DIVID utilizes explicit knowledge from diffusion models to extract Reconstruction Errors of video frames (DIRE) and captures temporal features between frames through a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture. Experimental results show that DIVID achieves an accuracy of 93.7% in detecting in-domain videos and significantly improves the accuracy of cross-domain video detection, with a maximum increase of 16 percentage points. Additionally, the paper creates a new benchmark video dataset containing videos generated by different video generation tools to facilitate research on diffusion-generated video detection. Through analysis of different diffusion steps and DDIM steps, the authors demonstrate the importance of DIVID in improving the detector's generalization ability.

Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos

What Matters in Detecting AI-Generated Videos like Sora?

Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features

Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global Temporal Defect Based Detection Method

The Tug-of-War Between Deepfake Generation and Detection

On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection

Time Step Generating: A Universal Synthesized Deepfake Image Detector

Diffusion Deepfake

Beyond Deepfake Images: Detecting AI-Generated Videos

AI-Generated Video Detection via Spatio-Temporal Anomaly Learning

Undercover Deepfakes: Detecting Fake Segments in Videos

AI-Generated Video Content Detection Using Vision Language Models

Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection

Let Real Images be as a Judger, Spotting Fake Images Synthesized with Generative Models

DistilDIRE: A Small, Fast, Cheap and Lightweight Diffusion Synthesized Deepfake Detection

Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis

Robustness and Generalizability of Deepfake Detection: A Study with Diffusion Models

Detecting images generated by diffusers

Towards the Detection of AI-Synthesized Human Face Images

Detecting Deepfake by Creating Spatio-Temporal Regularity Disruption

StealthDiffusion: Towards Evading Diffusion Forensic Detection through Diffusion Model