Abstract:In recent years, deepfakes (DFs) have been utilized for malicious purposes, such as individual impersonation, misinformation spreading, and artists' style imitation, raising questions about ethical and security concerns. However, existing surveys have focused on accuracy performance of passive DF detection approaches for single modalities, such as image, video or audio. This comprehensive survey explores passive approaches across multiple modalities, including image, video, audio, and multi-modal domains, and extend our discussion beyond detection accuracy, including generalization, robustness, attribution, and interpretability. Additionally, we discuss threat models for passive approaches, including potential adversarial strategies and different levels of adversary knowledge and capabilities. We also highlights current challenges in DF detection, including the lack of generalization across different generative models, the need for comprehensive trustworthiness evaluation, and the limitations of existing multi-modal approaches. Finally, we propose future research directions that address these unexplored and emerging issues in the field of passive DF detection, such as adaptive learning, dynamic benchmark, holistic trustworthiness evaluation, and multi-modal detectors for talking-face video generation.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the detection of deepfakes (Deepfake, abbreviated as DF) in multimodal (image, video, audio, and multimodal fields). Specifically:
1. **Ethical and Security Issues**: In recent years, deepfake technology has been used for malicious purposes, such as personal impersonation, spreading false information, and unauthorized artistic style imitation, etc., which has raised serious ethical and security issues. Therefore, effective detection methods are urgently needed to identify these synthetic contents.
2. **Limitations of Existing Research**: Existing reviews mainly focus on passive DF detection methods in a single modality (such as image, video, or audio), and most of them only focus on detection accuracy. However, in practical applications, comprehensive cross - modal evaluation is more important, including but not limited to the generalization ability, robustness, attributability, and interpretability of detection methods.
3. **Discussion of Threat Models**: The paper also discusses threat models for passive DF detection, including potential adversarial strategies and different levels of attacker knowledge and capabilities. This helps to understand the security and limitations of current detection methods.
4. **Current Challenges and Future Directions**: The article points out the main challenges currently faced by DF detection, such as insufficient generalization ability between different generation models, lack of comprehensive trustworthiness evaluation, and limitations of existing cross - modal methods. In addition, the author also proposes future research directions, such as adaptive learning, dynamic benchmarking, overall trustworthiness evaluation, and multimodal detectors for Talking - Face video generation.
### Specific Objectives
- **Cross - Modal Detection**: Explore and summarize the applications of passive DF detection methods in the fields of image, video, audio, and multimodal.
- **Beyond Accuracy**: Not only focus on detection accuracy, but also deeply explore key aspects such as generalization ability, robustness, attributability, and interpretability.
- **Threat Modeling**: Define and analyze threat models for passive DF detection, and provide insights into security considerations.
- **Meeting Challenges**: Propose new ideas and future research directions to solve the current challenges in DF detection.
Through these efforts, the paper aims to provide a comprehensive perspective for academia and industry to better understand and deal with the complex challenges brought by deepfakes.