Abstract:In recent years, deepfakes (DFs) have been utilized for malicious purposes, such as individual impersonation, misinformation spreading, and artists' style imitation, raising questions about ethical and security concerns. However, existing surveys have focused on accuracy performance of passive DF detection approaches for single modalities, such as image, video or audio. This comprehensive survey explores passive approaches across multiple modalities, including image, video, audio, and multi-modal domains, and extend our discussion beyond detection accuracy, including generalization, robustness, attribution, and interpretability. Additionally, we discuss threat models for passive approaches, including potential adversarial strategies and different levels of adversary knowledge and capabilities. We also highlights current challenges in DF detection, including the lack of generalization across different generative models, the need for comprehensive trustworthiness evaluation, and the limitations of existing multi-modal approaches. Finally, we propose future research directions that address these unexplored and emerging issues in the field of passive DF detection, such as adaptive learning, dynamic benchmark, holistic trustworthiness evaluation, and multi-modal detectors for talking-face video generation.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the detection of deepfakes (Deepfake, abbreviated as DF) in multimodal (image, video, audio, and multimodal fields). Specifically: 1. **Ethical and Security Issues**: In recent years, deepfake technology has been used for malicious purposes, such as personal impersonation, spreading false information, and unauthorized artistic style imitation, etc., which has raised serious ethical and security issues. Therefore, effective detection methods are urgently needed to identify these synthetic contents. 2. **Limitations of Existing Research**: Existing reviews mainly focus on passive DF detection methods in a single modality (such as image, video, or audio), and most of them only focus on detection accuracy. However, in practical applications, comprehensive cross - modal evaluation is more important, including but not limited to the generalization ability, robustness, attributability, and interpretability of detection methods. 3. **Discussion of Threat Models**: The paper also discusses threat models for passive DF detection, including potential adversarial strategies and different levels of attacker knowledge and capabilities. This helps to understand the security and limitations of current detection methods. 4. **Current Challenges and Future Directions**: The article points out the main challenges currently faced by DF detection, such as insufficient generalization ability between different generation models, lack of comprehensive trustworthiness evaluation, and limitations of existing cross - modal methods. In addition, the author also proposes future research directions, such as adaptive learning, dynamic benchmarking, overall trustworthiness evaluation, and multimodal detectors for Talking - Face video generation. ### Specific Objectives - **Cross - Modal Detection**: Explore and summarize the applications of passive DF detection methods in the fields of image, video, audio, and multimodal. - **Beyond Accuracy**: Not only focus on detection accuracy, but also deeply explore key aspects such as generalization ability, robustness, attributability, and interpretability. - **Threat Modeling**: Define and analyze threat models for passive DF detection, and provide insights into security considerations. - **Meeting Challenges**: Propose new ideas and future research directions to solve the current challenges in DF detection. Through these efforts, the paper aims to provide a comprehensive perspective for academia and industry to better understand and deal with the complex challenges brought by deepfakes.

Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey

Understanding Audiovisual Deepfake Detection: Techniques, Challenges, Human Factors and Perceptual Insights

Deep Learning for Deepfakes Creation and Detection: A Survey

A Contemporary Survey on Deepfake Detection: Datasets, Algorithms, and Challenges

Evolving from Single-modal to Multi-modal Facial Deepfake Detection: A Survey

Combating deepfakes: a comprehensive multilayer deepfake video detection framework

Deepfake Attacks: Generation, Detection, Datasets, Challenges, and Research Directions

A Comprehensive Survey on Deepfake Methods: Generation, Detection, and Applications

Deepfake forensics: a survey of digital forensic methods for multimodal deepfake identification on social media

SoK: Facial Deepfake Detectors

The Tug-of-War Between Deepfake Generation and Detection

A Multimodal Framework for Deepfake Detection

Deepfake Detection: A Comprehensive Survey from the Reliability Perspective

A Survey of Deepfake Detection Methods: Innovations, Accuracy, and Future Directions

Deep Learning Technology for Face Forgery Detection: A Survey

Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward

Deepfake video detection: challenges and opportunities

Multi-feature fusion based face forgery detection with local and global characteristics

DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer

Deepfake Generation and Detection: A Benchmark and Survey

Multimodal Deepfake Detection for Short Videos