Abstract:Recent advances in generative models and the availability of large-scale benchmarks have made deepfake video generation and manipulation easier. Nowadays, the number of new hyper-realistic deepfake videos used for negative purposes is dramatically increasing, thus creating the need for effective deepfake detection methods. Although many existing deepfake detection approaches, particularly CNN-based methods, show promising results, they suffer from several drawbacks. In general, poor generalization results have been obtained under unseen/new deepfake generation methods. The crucial reason for the above defect is that CNN-based methods focus on the local spatial artifacts, which are unique for every manipulation method. Therefore, it is hard to learn the general forgery traces of different manipulation methods without considering the dependencies that extend beyond the local receptive field. To address this problem, this paper proposes a framework that combines aper proposes a framework that combines with Vision Transformer (ViT) to improve detection accuracy and enhance generalizability. Our method, named HCiT , exploits the advantages of CNNs to extract meaningful local features, as well as the VIT’s self-attention mechanism to learn discriminative global contextual dependencies in a frame-level image explicitly. In this hybrid architecture, the high-level feature maps extracted from the CNN are fed into the ViT model that determines whether a specific video is fake or real. Experiments were performed on Faceforensics++, DeepFake Detection Challenge preview, Celeb datasets, and the results show that the proposed method significantly outperforms the state-of-the-art methods. In addition, the HCiT method shows a great capacity for generalization on datasets covering various techniques of deepfake generation. The source code is available at: https://github.com/KADDAR-Bachir/HCiT

MSVT: Multiple Spatiotemporal Views Transformer for DeepFake Video Detection

ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection

Deepfake Detection Based on Cross-Domain Local Characteristic Analysis with Multi-Domain Transformer

Spatio-Temporal Catcher: A Self-Supervised Transformer for Deepfake Video Detection

Deepfake Video Detection with Spatiotemporal Dropout Transformer

Deepfake Detection Using Spatiotemporal Transformer

Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection

A Timely Survey on Vision Transformer for Deepfake Detection

Video Forgery Detection Using Spatio-Temporal Dual Transformer.

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

DeepFake detection with multi-scale convolution and vision transformer

Deepfake Video Detection Using Convolutional Vision Transformer

Spatio-temporal Features for Generalized Detection of Deepfake Videos

Deep Convolutional Pooling Transformer for Deepfake Detection

Adt: anti-deepfake transformer

Towards Spatio-temporal Collaborative Learning: An End-to-End Deepfake Video Detection Framework.

DeepFake detection algorithm based on improved vision transformer

FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection

Exposing Deepfake Videos with Spatial, Frequency and Multi-scale Temporal Artifacts

Texture-aware and Shape-guided Transformer for Sequential DeepFake Detection.