DIP: Diffusion Learning of Inconsistency Pattern for General DeepFake Detection

Fan Nie,Jiangqun Ni,Jian Zhang,Bin Zhang,Weizhe Zhang
2024-10-31
Abstract:With the advancement of deepfake generation techniques, the importance of deepfake detection in protecting multimedia content integrity has become increasingly obvious. Recently, temporal inconsistency clues have been explored to improve the generalizability of deepfake video detection. According to our observation, the temporal artifacts of forged videos in terms of motion information usually exhibits quite distinct inconsistency patterns along horizontal and vertical directions, which could be leveraged to improve the generalizability of detectors. In this paper, a transformer-based framework for Diffusion Learning of Inconsistency Pattern (DIP) is proposed, which exploits directional inconsistencies for deepfake video detection. Specifically, DIP begins with a spatiotemporal encoder to represent spatiotemporal information. A directional inconsistency decoder is adopted accordingly, where direction-aware attention and inconsistency diffusion are incorporated to explore potential inconsistency patterns and jointly learn the inherent relationships. In addition, the SpatioTemporal Invariant Loss (STI Loss) is introduced to contrast spatiotemporally augmented sample pairs and prevent the model from overfitting nonessential forgery artifacts. Extensive experiments on several public datasets demonstrate that our method could effectively identify directional forgery clues and achieve state-of-the-art performance.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: With the continuous progress of deepfake generation technology, how to improve the universality and accuracy of detecting fake videos to protect the authenticity and integrity of multimedia content. Specifically, the paper proposes a new framework - a method based on Diffusion - Inconsistency Pattern (DIP) for general deepfake video detection. ### Main Problems 1. **Insufficient Generalization Ability in Deepfake Detection**: - The performance of existing deepfake detection methods drops significantly during cross - dataset evaluation, especially when facing unseen facial manipulation techniques and data distributions. 2. **Insufficient Utilization of Spatio - Temporal Inconsistency Cues**: - Previous studies failed to fully combine spatial and temporal forgery cues, resulting in models being not comprehensive enough in capturing forgery features, thus affecting the generalization ability. 3. **Lack of Modeling of Directional Motion Inconsistency**: - Existing methods do not fully utilize the motion inconsistencies in the horizontal and vertical directions in fake videos, and these inconsistencies can provide important forgery cues. ### Solutions To address the above problems, the paper proposes the following innovations: 1. **DIP Framework**: - A new deepfake video detection framework DIP is proposed, which improves the generalization ability of detection by capturing the spatio - temporal inconsistencies in the horizontal and vertical directions of fake videos. 2. **Directional Cross - Attention (DiCA) Module**: - The DiCA module is designed to model the temporal inconsistencies in the horizontal and vertical directions of fake videos, and at the same time introduce directional interaction and diffusion learning to capture more subtle forgery cues. 3. **Inconsistency Diffusion Module (IDM)**: - The IDM module is used to represent the temporal artifact diffusion patterns in the horizontal and vertical directions of fake videos, helping the model to better learn spatio - temporal representations. 4. **Spatio - Temporal Invariant Loss (STI Loss)**: - A new STI loss function is developed, combined with spatio - temporal data augmentation strategies, to drive the model to learn more representative forgery features, thereby improving generalization ability and robustness. ### Method Overview - **Spatio - Temporal Encoder (STE)**: Extract the spatial and temporal features of the video, and divide the features into horizontal and vertical feature sequences through directional pooling operations. - **Joint Directional Inconsistency Decoder (DID)**: Integrate directional features and learn better cross - directional inconsistency representations through the DiCA and IDM modules. - **Multi - Directional Classifier (MDC)**: Use the learned directional inconsistency features for the final classification of real and fake videos. Through these innovations, the DIP framework can effectively identify fake videos on multiple public datasets and shows performance superior to existing methods.