Abstract:Explainability in artificial intelligence is crucial for restoring trust, particularly in areas like face forgery detection, where viewers often struggle to distinguish between real and fabricated content. Vision and Large Language Models (VLLM) bridge computer vision and natural language, offering numerous applications driven by strong common-sense reasoning. Despite their success in various tasks, the potential of vision and language remains underexplored in face forgery detection, where they hold promise for enhancing explainability by leveraging the intrinsic reasoning capabilities of language to analyse fine-grained manipulation areas. As such, there is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task to systematically and fairly evaluate these capabilities. Previous efforts for unified benchmarks in deepfake detection have focused on the simpler binary task, overlooking evaluation protocols for fine-grained detection and text-generative models. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap. In the first stage, we assess the models' performance on the binary task and their sensitivity to given instructions using several prompts. In the second stage, we delve deeper into fine-grained detection by identifying areas of manipulation in a multiple-choice VQA setting. In the third stage, we convert the fine-grained detection to an open-ended question and compare several matching strategies for the multi-label classification task. Finally, we qualitatively evaluate the fine-grained responses of the VLLMs included in the benchmark. We apply our benchmark to several popular models, providing a detailed comparison of binary, multiple-choice, and open-ended VQA evaluation across seven datasets. \url{<a class="link-external link-https" href="https://nickyfot.github.io/hitchhickersguide.github.io/" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The main problem this paper attempts to address is improving the interpretability of Deepfake detection and achieving this goal through Visual and Language Models (VLLM). Specifically, the paper focuses on the following points: 1. **Improving the interpretability of Deepfake detection**: Traditional Deepfake detection methods mainly rely on deep binary classifiers, which are often black boxes and difficult to interpret their prediction results. The paper proposes utilizing the natural language generation capabilities of VLLM to explain detection results through Visual Question Answering (VQA) tasks, thereby enhancing the transparency and credibility of the model. 2. **Exploring the application of VLLM in Deepfake detection**: Although VLLM has shown excellent performance in other tasks, its application in the field of Deepfake detection is still limited. The paper aims to evaluate the performance of VLLM in this field by transforming Deepfake detection into a VQA task, particularly its potential in fine-grained detection. 3. **Establishing a systematic evaluation framework**: Existing evaluation methods mainly focus on simple binary classification tasks and lack comprehensive evaluation for multi-label fine-grained detection. The paper proposes a multi-stage evaluation protocol, including binary classification tasks, multiple-choice VQA, and open-ended VQA, to systematically evaluate the performance of different VLLM architectures in Deepfake detection. 4. **Addressing the limitations of existing benchmarks**: Current benchmark methods are mostly based on binary or multi-class classification tasks and are not suitable for multi-label fine-grained detection. The new benchmark method proposed in the paper can fairly and comprehensively evaluate the performance of VLLM in Deepfake detection without requiring extensive manual annotation. Through these efforts, the paper aims to advance research in the field of Deepfake detection, particularly in improving the interpretability and generalization capabilities of models.

A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning

FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant

Common Sense Reasoning for Deep Fake Detection

Common Sense Reasoning for Deepfake Detection

Visual Realism Assessment for Face-swap Videos

Unified Video and Image Representation for Boosted Video Face Forgery Detection

A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

Counterfactual Explanations for Face Forgery Detection via Adversarial Removal of Artifacts

Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer

$\textit{X}^2$-DFD: A framework for e${X}$plainable and e${X}$tendable Deepfake Detection

Deep Face Forgery Detection

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

Towards Quantitative Evaluation of Explainable AI Methods for Deepfake Detection

Semantic Contextualization of Face Forgery: A New Definition, Dataset, and Detection Method

Exploring Bi-Level Inconsistency Via Blended Images for Generalizable Face Forgery Detection

From Pixels to Words: Leveraging Explainability in Face Recognition through Interactive Natural Language Processing

SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models

Bridging Human Concepts and Computer Vision for Explainable Face Verification

ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization

FakeBench: Uncover the Achilles' Heels of Fake Images with Large Multimodal Models