Abstract:The rapid advancement of deep learning and large-scale AI models has simplified the creation and manipulation of deepfake technologies, which generate, edit, and replace faces in images and videos. This gradual ease of use has turned the malicious application of forged faces into a significant threat, complicating the task of deepfake detection. Despite the notable success of current deepfake detection methods, which predominantly employ data-driven CNN classification models, these methods exhibit limited generalization capabilities and insufficient robustness against novel data unseen during training. To tackle these challenges, this paper introduces a novel detection framework, ReLAF-Net. This framework employs a restricted self-attention mechanism that applies self-attention to deep CNN features flexibly, facilitating the learning of local relationships and inter-regional dependencies at both fine-grained and global levels. This attention mechanism has a modular design that can be seamlessly integrated into CNN networks to improve overall detection performance. Additionally, we propose an adaptive local frequency feature extraction algorithm that decomposes RGB images into fine-grained frequency domains in a data-driven manner, effectively isolating fake indicators in the frequency space. Moreover, an attention-based channel fusion strategy is developed to amalgamate RGB and frequency information, achieving a comprehensive facial representation. Tested on the high-quality version of the FaceForensics++ dataset, our method attained a detection accuracy of 97.92%, outperforming other approaches. Cross-dataset validation on Celeb-DF, DFDC, and DFD confirms the robust generalizability, offering a new solution for detecting high-quality deepfake videos.

Towards generalisable and calibrated synthetic speech detection with self-supervised representations

Ghost-in-Wave: How Speaker-Irrelative Features Interfere DeepFake Voice Detectors

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

Transferring Audio Deepfake Detection Capability Across Languages

Does Audio Deepfake Detection Generalize?

Deepfake audio detection by speaker verification

Characterizing the temporal dynamics of universal speech representations for generalizable deepfake detection

Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection

Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection

Towards generalizing deep-audio fake detection networks

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

FakeSound: Deepfake General Audio Detection

Voice Deepfake Detection Using the Self-Supervised Pre-Training Model HuBERT

Refining Localized Attention Features with Multi-Scale Relationships for Enhanced Deepfake Detection in Spatial-Frequency Domain

AVForensics: Audio-driven Deepfake Video Detection with Masking Strategy in Self-supervision.

How Generalizable are Deepfake Image Detectors? An Empirical Study

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

Self-supervised Transformer for Deepfake Detection

Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

A robust audio deepfake detection system via multi-view feature

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning