Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos

Qingyuan Liu,Pengyuan Shi,Yun-Yun Tsai,Chengzhi Mao,Junfeng Yang
2024-06-14
Abstract:The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works to combat Deepfakes videos have developed detectors that are highly accurate at identifying GAN-generated samples. However, the robustness of these detectors on diffusion-generated videos generated from video creation tools (e.g., SORA by OpenAI, Runway Gen-2, and Pika, etc.) is still unexplored. In this paper, we propose a novel framework for detecting videos synthesized from multiple state-of-the-art (SOTA) generative models, such as Stable Video Diffusion. We find that the SOTA methods for detecting diffusion-generated images lack robustness in identifying diffusion-generated videos. Our analysis reveals that the effectiveness of these detectors diminishes when applied to out-of-domain videos, primarily because they struggle to track the temporal features and dynamic variations between frames. To address the above-mentioned challenge, we collect a new benchmark video dataset for diffusion-generated videos using SOTA video creation tools. We extract representation within explicit knowledge from the diffusion model for video frames and train our detector with a CNN + LSTM architecture. The evaluation shows that our framework can well capture the temporal features between frames, achieves 93.7% detection accuracy for in-domain videos, and improves the accuracy of out-domain videos by up to 16 points.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem addressed in this paper is the limitations of video detection for artificially generated videos, especially for diffusion model-generated videos. Although existing detection methods can effectively identify Deepfake videos based on Generative Adversarial Networks (GANs), they perform poorly in detecting diffusion-generated videos produced by tools like OpenAI's SORA, Runway Gen-2, and Pika. The paper points out that these detectors experience a decline in performance when dealing with cross-domain videos due to their inability to capture spatiotemporal features and dynamic changes between frames. To address this issue, the paper proposes a new framework called DIVID for detecting synthesized videos from various advanced generation models, such as Stable Video Diffusion. DIVID utilizes explicit knowledge from diffusion models to extract Reconstruction Errors of video frames (DIRE) and captures temporal features between frames through a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture. Experimental results show that DIVID achieves an accuracy of 93.7% in detecting in-domain videos and significantly improves the accuracy of cross-domain video detection, with a maximum increase of 16 percentage points. Additionally, the paper creates a new benchmark video dataset containing videos generated by different video generation tools to facilitate research on diffusion-generated video detection. Through analysis of different diffusion steps and DDIM steps, the authors demonstrate the importance of DIVID in improving the detector's generalization ability.