Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

Jingxuan Wei,Cheng Tan,Zhangyang Gao,Linzhuang Sun,Siyuan Li,Bihui Yu,Ruifeng Guo,Stan Z. Li
DOI: https://doi.org/10.1007/s00521-024-10310-2
2024-08-18
Neural Computing and Applications
Abstract:Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, primarily focused on multimodal scientific questions and explanations from elementary and high school textbooks, exhibits limitations in providing a comprehensive evaluation across a broader spectrum of open-domain questions. To address this gap, we introduce the COCO Multi-Modal Reasoning (COCO-MMR) dataset, a comprehensive collection of open-ended questions, rationales, and answers derived from the COCO dataset. Unlike previous datasets that rely on multiple-choice questions, our dataset utilizes open-ended questions to more effectively challenge and assess CoT models' reasoning capabilities. Through comprehensive evaluations and detailed analyses, we demonstrate that our multihop cross-modal attention and sentence-level contrastive learning modules, designed to simulate human thought processes, significantly enhance model comprehension abilities. Experiments confirm the proposed dataset and techniques, showing their potential to advance multimodal reasoning. The data and code are available at https://github.com/weijingxuan/COCO-MMR.
computer science, artificial intelligence
What problem does this paper attempt to address?