On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection

Xiufeng Song,Xiao Guo,Jiache Zhang,Qirui Li,Lei Bai,Xiaoming Liu,Guangtao Zhai,Xiaohong Liu

2024-10-31

Abstract:Large numbers of synthesized videos from diffusion models pose threats to information security and authenticity, leading to an increasing demand for generated content detection. However, existing video-level detection algorithms primarily focus on detecting facial forgeries and often fail to identify diffusion-generated content with a diverse range of semantics. To advance the field of video forensics, we propose an innovative algorithm named Multi-Modal Detection(MM-Det) for detecting diffusion-generated videos. MM-Det utilizes the profound perceptual and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR) from LMM's multi-modal space, enhancing its ability to detect unseen forgery content. Besides, MM-Det leverages an In-and-Across Frame Attention (IAFA) mechanism for feature augmentation in the spatio-temporal domain. A dynamic fusion strategy helps refine forgery representations for the fusion. Moreover, we construct a comprehensive diffusion video dataset, called Diffusion Video Forensics (DVF), across a wide range of forgery videos. MM-Det achieves state-of-the-art performance in DVF, demonstrating the effectiveness of our algorithm. Both source code and DVF are available at <a class="link-external link-https" href="https://github.com/SparkleXFantasy/MM-Det" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to detect videos generated by diffusion models to counter the threats these synthetic videos pose to information security and authenticity. Existing video-level detection algorithms mainly focus on detecting facial forgeries and often fail to effectively identify diffusion-generated content with diverse semantics. Therefore, the authors propose an innovative multi-modal detection algorithm (Multi-Modal Detection, MM-Det) aimed at improving the detection capability of unseen forged content. Specifically, the main contributions of the paper include: 1. **Proposing a new detection method**: MM-Det utilizes multi-modal forged representations (MMFR) generated by large multi-modal models (LMMs), which can effectively detect diffusion-generated videos and have strong generalization capabilities. 2. **Introducing a powerful intra-frame and cross-frame attention mechanism (IAFA)**: By aggregating local and global patterns in forged videos, it enhances the detection of spatial artifacts and temporal inconsistencies. 3. **Constructing a large-scale dataset**: The Diffusion Video Forensics (DVF) dataset contains high-quality forged videos generated by various diffusion models, covering videos of different resolutions and durations, and can serve as an open-world benchmark for video forgery detection. 4. **Achieving state-of-the-art detection performance**: On the DVF dataset, MM-Det's detection performance reached state-of-the-art levels, and detailed analysis demonstrated the effectiveness of multi-modal representations in detecting forged content, providing new opportunities for future multimedia forensics research. Through these contributions, the paper not only addresses the shortcomings of current detection algorithms in handling diffusion-generated videos but also advances the frontier of video forgery detection.

On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection

Unified Video and Image Representation for Boosted Video Face Forgery Detection

Self-supervised Multi-Modal Video Forgery Attack Detection

MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection

Turns Out I'm Not Real: Towards Robust Detection of AI-Generated Videos

Identity-Driven Multimedia Forgery Detection via Reference Assistance

Real-Time Video Forgery Detection Via Vision-WiFi Silhouette Correspondence

Effective and efficient pixel-level detection for diverse video copy-move forgery types

MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection

Trinity Detector:text-assisted and attention mechanisms based spectral fusion for diffusion generation image detection

Multi-level feature disentanglement network for cross-dataset face forgery detection

AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

Research on video face forgery detection model based on multiple feature fusion network

Diffusion Facial Forgery Detection

Unveiling Universal Forensics of Diffusion Models with Adversarial Perturbations

UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery Localization

On the detection of synthetic images generated by diffusion models

Audio-Visual Temporal Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss

Exploring varying color spaces through representative forgery learning to improve deepfake detection

Federated Face Forgery Detection Learning with Personalized Representation