Abstract:The proliferation of fake images generated by deepfake techniques has significantly threatened the trustworthiness of digital information, leading to a pressing need for face forgery detection. However, due to the similarity between human face images and the subtlety of artefact information, most deep face forgery detection methods face certain challenges, such as incomplete extraction of artefact information, limited performance in detecting low-quality forgeries, and insufficient generalization across different datasets. To address these issues, this paper proposes a novel noise-aware multi-scale deepfake detection model. Firstly, a progressive spatial attention module is introduced, which learns two types of spatial feature weights: boosting weight and suppression weight. The boosting weight highlights salient regions, while the suppression weight enables the model to capture more subtle artifact information. Through multiple boosting-suppression stages, the proposed model progressively focuses on different facial regions and extracts multi-scale RGB features. Additionally, a noise-aware two-stream network is introduced, which leverages frequency-domain features and fuses image noise with multi-scale RGB features. This integration enhances the model's ability to handle image post-processing. Furthermore, the model learns global features from multi-modal features through multiple convolutional layers, which are combined with local similarity features for deepfake detection, thereby improving the model's robustness. Experimental results on several benchmark databases demonstrate the superiority of our proposed method over state-of-the-art techniques. Our contributions lie in the progressive spatial attention module, which effectively addresses overfitting in CNNs, and the integration of noise-aware features and multi-scale RGB features. These innovations lead to enhanced accuracy and generalization performance in face forgery detection.

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer

Adt: anti-deepfake transformer

Refining Localized Attention Features with Multi-Scale Relationships for Enhanced Deepfake Detection in Spatial-Frequency Domain

Deep Convolutional Pooling Transformer for Deepfake Detection

DeepFake detection algorithm based on improved vision transformer

Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection

Spatial-temporal Transformer Network for Protecting Person-of-interest from Deepfaking

DeepFake detection method based on multi-scale interactive dual-stream network

Multi-attentional Deepfake Detection

MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection

Multi-feature fusion based face forgery detection with local and global characteristics

Noise-aware progressive multi-scale deepfake detection

SRTNet: a spatial and residual based two-stream neural network for deepfakes detection

AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

Delving into Sequential Patches for Deepfake Detection

Deepfake Video Detection with Spatiotemporal Dropout Transformer

$\textit{X}^2$-DFD: A framework for e${X}$plainable and e${X}$tendable Deepfake Detection

F2Trans: High-Frequency Fine-Grained Transformer for Face Forgery Detection

ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection