MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs

Xuannan Liu,Zekun Li,Peipei Li,Shuhan Xia,Xing Cui,Linzhi Huang,Huaibo Huang,Weihong Deng,Zhaofeng He
2024-08-21
Abstract:Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field. To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD. MMFakeBench includes 3 critical sources: textual veracity distortion, visual veracity distortion, and cross-modal consistency distortion, along with 12 sub-categories of misinformation forgery types. We further conduct an extensive evaluation of 6 prevalent detection methods and 15 large vision-language models (LVLMs) on MMFakeBench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting. Additionally, we propose an innovative unified framework, which integrates rationales, actions, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization. We believe this study will catalyze future research into more realistic mixed-source multimodal misinformation and provide a fair evaluation of misinformation detection methods.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is that current Multimodal Misinformation Detection (MMD) methods typically assume that each sample has only one fabricated source and type, which is insufficient in the real world where multiple fabricated sources often coexist. This single-source assumption limits the effectiveness of existing methods in complex, real-world scenarios. Additionally, the lack of a benchmark dataset for mixed-source misinformation also hinders progress in this field. To tackle these challenges, the authors propose **MMFakeBench**, the first comprehensive mixed-source multimodal misinformation detection benchmark dataset. MMFakeBench includes three key sources of misinformation: textual authenticity distortion, visual authenticity distortion, and cross-modal consistency distortion, as well as 12 subcategories of misinformation fabrication types. Using this dataset, the authors conducted extensive zero-shot evaluations of 6 existing detection methods and 15 large vision-language models (LVLMs), revealing that current methods perform poorly on this challenging mixed-source multimodal misinformation detection task. Furthermore, the authors propose an innovative unified framework—**MMD-Agent**. This framework integrates the reasoning, action, and tool-using capabilities of LVLM agents, significantly improving detection performance and generalization ability. MMD-Agent decomposes the mixed-source detection task into three stages: textual authenticity check, visual authenticity check, and cross-modal consistency reasoning, ensuring systematic and thorough methodology. In summary, the main contributions of this paper include: 1. Introducing the concept of mixed-source multimodal misinformation detection (MMD), breaking the single-source limitation and advancing practical misinformation detection tasks. 2. Developing MMFakeBench, the first benchmark dataset for evaluating mixed-source MMD. 3. Benchmarking 6 popular detection methods and 15 LVLMs using the newly collected dataset. 4. Proposing the MMD-Agent framework, which significantly outperforms existing methods and LVLMs, providing new baselines for future research.