Abstract:In response to the rapid advancements in facial manipulation technologies, particularly facilitated by Generative Adversarial Networks (GANs) and Stable Diffusion-based methods, this paper explores the critical issue of deepfake content creation. The increasing accessibility of these tools necessitates robust detection methods to curb potential misuse. In this context, this paper investigates the potential of Vision Transformers (ViTs) for effective deepfake image detection, leveraging their capacity to extract global features. Objective: The primary goal of this study is to assess the viability of ViTs in detecting multiclass deepfake images compared to traditional Convolutional Neural Network (CNN)-based models. By framing the deepfake problem as a multiclass task, this research introduces a novel approach, considering the challenges posed by Stable Diffusion and StyleGAN2. The objective is to enhance understanding and efficacy in detecting manipulated content within a multiclass context. Novelty: This research distinguishes itself by approaching the deepfake detection problem as a multiclass task, introducing new challenges associated with Stable Diffusion and StyleGAN2. The study pioneers the exploration of ViTs in this domain, emphasizing their potential to extract global features for enhanced detection accuracy. The novelty lies in addressing the evolving landscape of deepfake creation and manipulation. Results and Conclusion: Through extensive experiments, the proposed method exhibits high effectiveness, achieving impressive detection accuracy, precision, and recall, and an F1 rate of 99.90% on a multiclass-prepared dataset. The results underscore the significant potential of ViTs in contributing to a more secure digital landscape by robustly addressing the challenges posed by deepfake content, particularly in the presence of Stable Diffusion and StyleGAN2. The proposed model outperformed when compared with state-of-the-art CNN-based models, i.e., ResNet-50 and VGG-16.

DeepFake detection with multi-scale convolution and vision transformer

Hierarchical Supervisions with Two-Stream Network for Deepfake Detection.

DeepFake detection algorithm based on improved vision transformer

Adt: anti-deepfake transformer

DeepFake detection method based on multi-scale interactive dual-stream network

Deep Convolutional Pooling Transformer for Deepfake Detection

Deepfake Detection Scheme Based on Vision Transformer and Distillation

Multi-feature fusion based face forgery detection with local and global characteristics

Hybrid Transformer Network for Deepfake Detection

DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer

Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection

Refining Localized Attention Features with Multi-Scale Relationships for Enhanced Deepfake Detection in Spatial-Frequency Domain

Deepfake Video Detection Using Convolutional Vision Transformer

Multi-attentional Deepfake Detection

Detection of deepfake technology in images and videos

Deepfake detection using convolutional vision transformers and convolutional neural networks

Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection

Deepfake Detection with Deep Learning: Convolutional Neural Networks versus Transformers

Multiclass AI-Generated Deepfake Face Detection Using Patch-Wise Deep Learning Model

Noise-aware progressive multi-scale deepfake detection

Deepfake Video Detection with Spatiotemporal Dropout Transformer