Abstract:Deepfake detection automatically recognizes the manipulated medias through the analysis of the difference between manipulated and non-altered videos. It is natural to ask which are the top performers among the existing deepfake detection approaches to identify promising research directions and provide practical guidance. Unfortunately, it's difficult to conduct a sound benchmarking comparison of existing detection approaches using the results in the literature because evaluation conditions are inconsistent across studies. Our objective is to establish a comprehensive and consistent benchmark, to develop a repeatable evaluation procedure, and to measure the performance of a range of detection approaches so that the results can be compared soundly. A challenging dataset consisting of the manipulated samples generated by more than 13 different methods has been collected, and 11 popular detection approaches (9 algorithms) from the existing literature have been implemented and evaluated with 6 fair-minded and practical evaluation metrics. Finally, 92 models have been trained and 644 experiments have been performed for the evaluation. The results along with the shared data and evaluation methodology constitute a benchmark for comparing deepfake detection approaches and measuring progress.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the lack of fair, comprehensive, and consistent benchmark tests in current Deepfake detection methods. Specifically, the author points out the following key issues: 1. **Unfair comparison**: Existing Deepfake detection methods are usually trained on different training datasets but evaluated on the same test dataset. This leads to unfair comparison because the performance of the detection model highly depends on its training data. 2. **Over - fitting and poor generalization ability**: Many existing detection methods are trained and evaluated on datasets with the same distribution, resulting in a significant decline in their performance in different distributions or real - world scenarios. This indicates that these methods have over - fitting problems and poor generalization ability. 3. **Incomplete evaluation metrics**: Commonly used evaluation metrics such as AUC (Area Under the Curve) and accuracy are not sufficient to comprehensively reflect the actual performance of detection methods. In particular, these metrics ignore important factors in practical applications such as time complexity and space complexity. To address these issues, the paper proposes a fair, comprehensive, and strict benchmark test framework. This framework includes the following aspects: - **Standard datasets**: Integrates 7 popular datasets to ensure data diversity and representativeness. - **Imperceptible and Diverse Test (ID) test set**: Constructs a high - quality test set that contains imperceptible forged samples to simulate challenges in the real world. - **Multiple evaluation metrics**: In addition to the traditional AUC and accuracy, four supplementary metrics are introduced to evaluate the robustness, efficiency, and practicality of the model. - **Re - implementation and evaluation of existing methods**: Re - implements 11 popular Deepfake detection methods and evaluates them in a unified experimental environment. Through these measures, the paper aims to provide a reliable benchmark in the field of Deepfake detection, helping to identify current best practices and guide future research directions.

Towards Benchmarking and Evaluating Deepfake Detection

DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection

DF40: Toward Next-Generation Deepfake Detection

Deepfake Generation and Detection: A Benchmark and Survey

A Survey on Deepfake Video Detection

DEEPFAKER: A Unified Evaluation Platform for Facial Deepfake and Detection Models

Assessment framework for deepfake detection in real-world situations

Impact of Video Processing Operations in Deepfake Detection

Deepfake: Definitions, Performance Metrics and Standards, Datasets and Benchmarks, and a Meta-Review

A Contemporary Survey on Deepfake Detection: Datasets, Algorithms, and Challenges

DeepFake Detection with Inconsistent Head Poses: Reproducibility and Analysis

WWW: Where, Which and Whatever Enhancing Interpretability in Multimodal Deepfake Detection

Robustness and Generalizability of Deepfake Detection: A Study with Diffusion Models

Deepfake Detection: A Comprehensive Study from the Reliability Perspective

Comparison of Deepfake Detection Techniques through Deep Learning

Countering Malicious DeepFakes: Survey, Battleground, and Horizon

Deepfake Detection: A Comprehensive Survey from the Reliability Perspective

Deepfake Videos in the Wild: Analysis and Detection

A Survey of Deepfake Detection Methods: Innovations, Accuracy, and Future Directions

VoiceWukong: Benchmarking Deepfake Voice Detection