On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection

Xiufeng Song,Xiao Guo,Jiache Zhang,Qirui Li,Lei Bai,Xiaoming Liu,Guangtao Zhai,Xiaohong Liu
2024-10-31
Abstract:Large numbers of synthesized videos from diffusion models pose threats to information security and authenticity, leading to an increasing demand for generated content detection. However, existing video-level detection algorithms primarily focus on detecting facial forgeries and often fail to identify diffusion-generated content with a diverse range of semantics. To advance the field of video forensics, we propose an innovative algorithm named Multi-Modal Detection(MM-Det) for detecting diffusion-generated videos. MM-Det utilizes the profound perceptual and comprehensive abilities of Large Multi-modal Models (LMMs) by generating a Multi-Modal Forgery Representation (MMFR) from LMM's multi-modal space, enhancing its ability to detect unseen forgery content. Besides, MM-Det leverages an In-and-Across Frame Attention (IAFA) mechanism for feature augmentation in the spatio-temporal domain. A dynamic fusion strategy helps refine forgery representations for the fusion. Moreover, we construct a comprehensive diffusion video dataset, called Diffusion Video Forensics (DVF), across a wide range of forgery videos. MM-Det achieves state-of-the-art performance in DVF, demonstrating the effectiveness of our algorithm. Both source code and DVF are available at <a class="link-external link-https" href="https://github.com/SparkleXFantasy/MM-Det" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to detect videos generated by diffusion models to counter the threats these synthetic videos pose to information security and authenticity. Existing video-level detection algorithms mainly focus on detecting facial forgeries and often fail to effectively identify diffusion-generated content with diverse semantics. Therefore, the authors propose an innovative multi-modal detection algorithm (Multi-Modal Detection, MM-Det) aimed at improving the detection capability of unseen forged content. Specifically, the main contributions of the paper include: 1. **Proposing a new detection method**: MM-Det utilizes multi-modal forged representations (MMFR) generated by large multi-modal models (LMMs), which can effectively detect diffusion-generated videos and have strong generalization capabilities. 2. **Introducing a powerful intra-frame and cross-frame attention mechanism (IAFA)**: By aggregating local and global patterns in forged videos, it enhances the detection of spatial artifacts and temporal inconsistencies. 3. **Constructing a large-scale dataset**: The Diffusion Video Forensics (DVF) dataset contains high-quality forged videos generated by various diffusion models, covering videos of different resolutions and durations, and can serve as an open-world benchmark for video forgery detection. 4. **Achieving state-of-the-art detection performance**: On the DVF dataset, MM-Det's detection performance reached state-of-the-art levels, and detailed analysis demonstrated the effectiveness of multi-modal representations in detecting forged content, providing new opportunities for future multimedia forensics research. Through these contributions, the paper not only addresses the shortcomings of current detection algorithms in handling diffusion-generated videos but also advances the frontier of video forgery detection.