Towards Benchmarking and Evaluating Deepfake Detection

Chenhao Lin,Jingyi Deng,Pengbin Hu,Chao Shen,Qian Wang,Qi Li
2024-03-13
Abstract:Deepfake detection automatically recognizes the manipulated medias through the analysis of the difference between manipulated and non-altered videos. It is natural to ask which are the top performers among the existing deepfake detection approaches to identify promising research directions and provide practical guidance. Unfortunately, it's difficult to conduct a sound benchmarking comparison of existing detection approaches using the results in the literature because evaluation conditions are inconsistent across studies. Our objective is to establish a comprehensive and consistent benchmark, to develop a repeatable evaluation procedure, and to measure the performance of a range of detection approaches so that the results can be compared soundly. A challenging dataset consisting of the manipulated samples generated by more than 13 different methods has been collected, and 11 popular detection approaches (9 algorithms) from the existing literature have been implemented and evaluated with 6 fair-minded and practical evaluation metrics. Finally, 92 models have been trained and 644 experiments have been performed for the evaluation. The results along with the shared data and evaluation methodology constitute a benchmark for comparing deepfake detection approaches and measuring progress.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of fair, comprehensive, and consistent benchmark tests in current Deepfake detection methods. Specifically, the author points out the following key issues: 1. **Unfair comparison**: Existing Deepfake detection methods are usually trained on different training datasets but evaluated on the same test dataset. This leads to unfair comparison because the performance of the detection model highly depends on its training data. 2. **Over - fitting and poor generalization ability**: Many existing detection methods are trained and evaluated on datasets with the same distribution, resulting in a significant decline in their performance in different distributions or real - world scenarios. This indicates that these methods have over - fitting problems and poor generalization ability. 3. **Incomplete evaluation metrics**: Commonly used evaluation metrics such as AUC (Area Under the Curve) and accuracy are not sufficient to comprehensively reflect the actual performance of detection methods. In particular, these metrics ignore important factors in practical applications such as time complexity and space complexity. To address these issues, the paper proposes a fair, comprehensive, and strict benchmark test framework. This framework includes the following aspects: - **Standard datasets**: Integrates 7 popular datasets to ensure data diversity and representativeness. - **Imperceptible and Diverse Test (ID) test set**: Constructs a high - quality test set that contains imperceptible forged samples to simulate challenges in the real world. - **Multiple evaluation metrics**: In addition to the traditional AUC and accuracy, four supplementary metrics are introduced to evaluate the robustness, efficiency, and practicality of the model. - **Re - implementation and evaluation of existing methods**: Re - implements 11 popular Deepfake detection methods and evaluates them in a unified experimental environment. Through these measures, the paper aims to provide a reliable benchmark in the field of Deepfake detection, helping to identify current best practices and guide future research directions.