Deepfake Detection Using Spatiotemporal Transformer

Bachir Kaddar,Sid Ahmed Fezza,Zahid Akhtar,Wassim Hamidouche,Abdenour Hadid,Joan Serra-Sagristà

DOI: https://doi.org/10.1145/3643030

2024-01-23

Abstract:Recent advances in generative models and the availability of large-scale benchmarks have made deepfake video generation and manipulation easier. Nowadays, the number of new hyper-realistic deepfake videos used for negative purposes is dramatically increasing, thus creating the need for effective deepfake detection methods. Although many existing deepfake detection approaches, particularly CNN-based methods, show promising results, they suffer from several drawbacks. In general, poor generalization results have been obtained under unseen/new deepfake generation methods. The crucial reason for the above defect is that CNN-based methods focus on the local spatial artifacts, which are unique for every manipulation method. Therefore, it is hard to learn the general forgery traces of different manipulation methods without considering the dependencies that extend beyond the local receptive field. To address this problem, this paper proposes a framework that combines aper proposes a framework that combines with Vision Transformer (ViT) to improve detection accuracy and enhance generalizability. Our method, named HCiT , exploits the advantages of CNNs to extract meaningful local features, as well as the VIT’s self-attention mechanism to learn discriminative global contextual dependencies in a frame-level image explicitly. In this hybrid architecture, the high-level feature maps extracted from the CNN are fed into the ViT model that determines whether a specific video is fake or real. Experiments were performed on Faceforensics++, DeepFake Detection Challenge preview, Celeb datasets, and the results show that the proposed method significantly outperforms the state-of-the-art methods. In addition, the HCiT method shows a great capacity for generalization on datasets covering various techniques of deepfake generation. The source code is available at: https://github.com/KADDAR-Bachir/HCiT

computer science, information systems, theory & methods, software engineering

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the issue of deepfake video detection. Specifically: 1. **Background**: - With the development of generative models (such as autoencoders and generative adversarial networks) and the availability of large-scale public datasets, the generation of deepfake videos has become increasingly easy and realistic. - The number of deepfake videos has surged and is being used for negative purposes, thus there is an urgent need for effective deepfake detection methods. 2. **Limitations of existing methods**: - Existing deepfake detection methods, especially those based on convolutional neural networks (CNNs), have poor generalization capabilities when faced with unseen deepfake generation techniques. - This is because CNNs mainly focus on local spatial artifacts while ignoring the global dependencies across the entire input sequence. 3. **Proposed method**: - To overcome these limitations, the authors propose a hybrid architecture that combines convolutional neural networks (CNNs) and vision transformers (ViTs), called HCiT. - This method utilizes CNNs to extract meaningful local features and employs the self-attention mechanism of ViTs to learn discriminative global contextual dependencies in frame-level images. 4. **Objective**: - To improve detection accuracy and enhance generalization capabilities, making it perform well on datasets involving various deepfake generation techniques. Through the above methods, the paper aims to develop a new technology capable of effectively detecting highly realistic deepfake videos.

Deepfake Detection Using Spatiotemporal Transformer

Deepfake Video Detection Using Convolutional Vision Transformer

DeepFake detection algorithm based on improved vision transformer

Deepfake Video Detection with Spatiotemporal Dropout Transformer

FakeFormer: Efficient Vulnerability-Driven Transformers for Generalisable Deepfake Detection

ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection

DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer

Spatio-temporal Features for Generalized Detection of Deepfake Videos

Deepfake Detection Scheme Based on Vision Transformer and Distillation

Deepfake detection: Enhancing performance with spatiotemporal texture and deep learning feature fusion

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

Adt: anti-deepfake transformer

Deepfake detection using convolutional vision transformers and convolutional neural networks

Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection

Hybrid Transformer Network for Deepfake Detection

Multiclass AI-Generated Deepfake Face Detection Using Patch-Wise Deep Learning Model

Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection

Improving Video Vision Transformer for Deepfake Video Detection Using Facial Landmark, Depthwise Separable Convolution and Self Attention

Tex-ViT: A Generalizable, Robust, Texture-based dual-branch cross-attention deepfake detector

Combining EfficientNet and Vision Transformers for Video Deepfake Detection

DeepFake detection method based on multi-scale interactive dual-stream network