Abstract:Previous deepfake detection methods mostly depend on low-level textural features vulnerable to perturbations and fall short of detecting unseen forgery methods. In contrast, high-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. Motivated by this, we propose a detection method that utilizes high-level semantic features of faces to identify inconsistencies in temporal domain. We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video classification network, initialized with a meta-functional face encoder for enriched facial representation. In this way, we can take advantage of both the powerful spatio-temporal model and the high-level semantic information of faces. Furthermore, to leverage easily accessible real face data and guide the model in focusing on spatio-temporal features, we design a Dynamic Video Self-Blending (DVSB) method to efficiently generate training samples with diverse spatio-temporal forgery traces using real facial videos. Based on this, we advance our framework with a two-stage training approach: The first stage employs a novel self-supervised contrastive learning, where we encourage the network to focus on forgery traces by impelling videos generated by the same forgery process to have similar representations. On the basis of the representation learned in the first stage, the second stage involves fine-tuning on face forgery detection dataset to build a deepfake detector. Extensive experiments validates that UniForensics outperforms existing face forgery methods in generalization ability and robustness. In particular, our method achieves 95.3\% and 77.2\% cross dataset AUC on the challenging Celeb-DFv2 and DFDC respectively.

Adt: anti-deepfake transformer

Deep Convolutional Pooling Transformer for Deepfake Detection

DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer

Hybrid Transformer Network for Deepfake Detection

Refining Localized Attention Features with Multi-Scale Relationships for Enhanced Deepfake Detection in Spatial-Frequency Domain

Multi-feature fusion based face forgery detection with local and global characteristics

Self-Supervised Graph Transformer for Deepfake Detection

Multi-attentional Deepfake Detection

DeepFake detection algorithm based on improved vision transformer

Deepfake Detection Scheme Based on Vision Transformer and Distillation

Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection

Deepfake Detection with Deep Learning: Convolutional Neural Networks versus Transformers

Deepfake Video Detection Using Convolutional Vision Transformer

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

Self-supervised Transformer for Deepfake Detection

UniForensics: Face Forgery Detection via General Facial Representation

Deepfake Video Detection with Spatiotemporal Dropout Transformer

Common Forgery Artifact Driven Deepfake Face Detection

DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection

Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection