Abstract:This paper presents the dual‐stream frequency‐spatial fusion network for deepfake detection, which integrates spatial and frequency domain features to enhance detection accuracy and robustness. The network includes a spatial forgery feature extraction module, a frequency forgery feature extraction module, and a spatial‐frequency feature fusion module, using attention mechanisms to extract and fuse features. Extensive experiments demonstrate that dual‐stream frequency‐spatial fusion network outperforms existing methods, offering superior generalization and robustness across various deepfake datasets. In recent years, face forgery detection has gained significant attention, resulting in considerable advancements. However, most existing methods rely on CNNs to extract artefacts from the spatial domain, overlooking the pervasive frequency‐domain artefacts present in deepfake content, which poses challenges in achieving robust and generalized detection. To address these issues, we propose the dual‐stream frequency—spatial fusion network is proposed for deepfake detection. The dual‐stream frequency‐spatial fusion network consists of three components: the spatial forgery feature extraction module, the frequency forgery feature extraction module, and the spatial–frequency feature fusion module. The spatial forgery feature extraction module employs spatial‐channel attention to extract spatial domain features, targeting artefacts in the spatial domain. The frequency forgery feature extraction module leverages the focused linear attention to detect frequency domain anomalies in internal regions, enabling the identification of generated content. The spatial–frequency feature fusion module then fuses forgery features extracted from both the spatial and frequency domains, facilitating accurate detection of splicing artefacts and internally generated forgeries. This approach enhances the model's ability to more accurately capture forgery characteristics. Extensive experiments on several widely‐used benchmarks demonstrate that our carefully designed network exhibits superior generalization and robustness, significantly improving deepfake detection performance.

Augmented Multi-Scale Spatiotemporal Inconsistency Magnifier for Generalized DeepFake Detection

Mining Generalized Multi-timescale Inconsistency for Detecting Deepfake Videos

Spatiotemporal Inconsistency Learning for DeepFake Video Detection

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Unearthing Common Inconsistency for Generalisable Deepfake Detection

Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

Dynamic Difference Learning with Spatio-temporal Correlation for Deepfake Video Detection

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Detecting Deepfake by Creating Spatio-Temporal Regularity Disruption

Dynamic Inconsistency-aware DeepFake Video Detection

Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection

Video Detection Method Based on Temporal and Spatial Foundations for Accurate Verification of Authenticity

Learning spatial‐frequency interaction for generalizable deepfake detection

Spatio-temporal Features for Generalized Detection of Deepfake Videos

Audio-Visual Temporal Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss

AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection

MMNet: Multi-Collaboration and Multi-Supervision Network for Sequential Deepfake Detection