A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning

Niki Maria Foteinopoulou,Enjie Ghorbel,Djamila Aouada
2024-10-31
Abstract:Explainability in artificial intelligence is crucial for restoring trust, particularly in areas like face forgery detection, where viewers often struggle to distinguish between real and fabricated content. Vision and Large Language Models (VLLM) bridge computer vision and natural language, offering numerous applications driven by strong common-sense reasoning. Despite their success in various tasks, the potential of vision and language remains underexplored in face forgery detection, where they hold promise for enhancing explainability by leveraging the intrinsic reasoning capabilities of language to analyse fine-grained manipulation areas. As such, there is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task to systematically and fairly evaluate these capabilities. Previous efforts for unified benchmarks in deepfake detection have focused on the simpler binary task, overlooking evaluation protocols for fine-grained detection and text-generative models. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap. In the first stage, we assess the models' performance on the binary task and their sensitivity to given instructions using several prompts. In the second stage, we delve deeper into fine-grained detection by identifying areas of manipulation in a multiple-choice VQA setting. In the third stage, we convert the fine-grained detection to an open-ended question and compare several matching strategies for the multi-label classification task. Finally, we qualitatively evaluate the fine-grained responses of the VLLMs included in the benchmark. We apply our benchmark to several popular models, providing a detailed comparison of binary, multiple-choice, and open-ended VQA evaluation across seven datasets. \url{<a class="link-external link-https" href="https://nickyfot.github.io/hitchhickersguide.github.io/" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem this paper attempts to address is improving the interpretability of Deepfake detection and achieving this goal through Visual and Language Models (VLLM). Specifically, the paper focuses on the following points: 1. **Improving the interpretability of Deepfake detection**: Traditional Deepfake detection methods mainly rely on deep binary classifiers, which are often black boxes and difficult to interpret their prediction results. The paper proposes utilizing the natural language generation capabilities of VLLM to explain detection results through Visual Question Answering (VQA) tasks, thereby enhancing the transparency and credibility of the model. 2. **Exploring the application of VLLM in Deepfake detection**: Although VLLM has shown excellent performance in other tasks, its application in the field of Deepfake detection is still limited. The paper aims to evaluate the performance of VLLM in this field by transforming Deepfake detection into a VQA task, particularly its potential in fine-grained detection. 3. **Establishing a systematic evaluation framework**: Existing evaluation methods mainly focus on simple binary classification tasks and lack comprehensive evaluation for multi-label fine-grained detection. The paper proposes a multi-stage evaluation protocol, including binary classification tasks, multiple-choice VQA, and open-ended VQA, to systematically evaluate the performance of different VLLM architectures in Deepfake detection. 4. **Addressing the limitations of existing benchmarks**: Current benchmark methods are mostly based on binary or multi-class classification tasks and are not suitable for multi-label fine-grained detection. The new benchmark method proposed in the paper can fairly and comprehensively evaluate the performance of VLLM in Deepfake detection without requiring extensive manual annotation. Through these efforts, the paper aims to advance research in the field of Deepfake detection, particularly in improving the interpretability and generalization capabilities of models.