Abstract:With the rapid development of face forgery techniques, the existing frame-based deepfake video detection methods have fell into a dilemma that frame-based methods may fail when encountering extremely realistic images. To overcome the above problem, many approaches attempted to model the spatio-temporal inconsistency of videos to distinguish real and fake videos. However, current works model spatio-temporal inconsistency by combining intra-frame and inter-frame information, but ignore the disturbance caused by facial motions that would limit further improvement in detection performance. To address this issue, we investigate into long and short range inter-frame motions and propose a novel dynamic difference learning method to distinguish between the inter-frame differences caused by face manipulation and the inter-frame differences caused by facial motions in order to model precise spatio-temporal inconsistency for deepfake video detection. Moreover, we elaborately design a dynamic fine-grained difference capture module (DFDC-module) and a multi-scale spatio-temporal aggregation module (MSA-module) to collaboratively model spatio-temporal inconsistency. Specifically, the DFDC-module applies self-attention mechanism and fine-grained denoising operation to eliminate the differences caused by facial motions and generates long range difference attention maps. The MSA-module is devised to aggregate multi-direction and multi-scale temporal information to model spatio-temporal inconsistency. The existing 2D CNNs can be extended into dynamic spatio-temporal inconsistency capture networks by integrating the proposed two modules. Extensive experimental results demonstrate that our proposed algorithm steadily outperforms state-of-the-art methods by a clear margin in different benchmark datasets.

Dynamic Difference Learning with Spatio-temporal Correlation for Deepfake Video Detection

Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection

Towards Spatio-temporal Collaborative Learning: An End-to-End Deepfake Video Detection Framework.

Dynamic Inconsistency-aware DeepFake Video Detection

Spatiotemporal Inconsistency Learning for DeepFake Video Detection

Exploiting Complementary Dynamic Incoherence for DeepFake Video Detection

Detection of Deepfake Videos Using Long-Distance Attention

Refining Localized Attention Features with Multi-Scale Relationships for Enhanced Deepfake Detection in Spatial-Frequency Domain

A Temporal Consistency Learning Framework for Face Forgery Detection

Unearthing Common Inconsistency for Generalisable Deepfake Detection

Dual-Modality Co-Learning for Unveiling Deepfake in Spatio-Temporal Space.

Learning to Detect Deepfakes via Adaptive Attention and Constrained Difference.

Temporal Consistency Based Deep Face Forgery Detection Network.

Double-Stream Segmentation Network with Temporal Self-attention for Deepfake Video Detection

Deepfake Video Detection Via Predictive Representation Learning

Face Forgery Detection Based on Fine-grained Clues and Noise Inconsistency

Detecting Deepfake Videos Based on Spatiotemporal Attention and Convolutional LSTM

Exposing Deepfake Videos with Spatial, Frequency and Multi-scale Temporal Artifacts

Hierarchical Contrastive Inconsistency Learning for Deepfake Video Detection

Analyzing temporal coherence for deepfake video detection

Spatio-temporal Features for Generalized Detection of Deepfake Videos