Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

Jingxuan Wei,Cheng Tan,Zhangyang Gao,Linzhuang Sun,Siyuan Li,Bihui Yu,Ruifeng Guo,Stan Z. Li
2023-09-25
Abstract:Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the reasoning capability of CoT models. Through comprehensive evaluations and detailed analyses, we provide valuable insights and propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders. Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning. The data and code are available at \href{<a class="link-external link-https" href="https://github.com/weijingxuan/COCO-MMR" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/weijingxuan/COCO-MMR" rel="external noopener nofollow">this https URL</a>}.
Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address several key issues in the field of multimodal reasoning and proposes a new dataset and framework to advance this area. Specifically: 1. **Problems with existing datasets**: The existing ScienceQA dataset, while focusing on multimodal scientific questions and their explanations, has limitations such as small scale, reliance on multiple-choice questions, and being confined to scientific question reasoning. 2. **New dataset COCO-MMR**: To overcome these limitations, the researchers developed the COCO-MMR dataset, which is three times larger than the ScienceQA dataset, containing approximately 62,000 open-ended questions, rationales, and answers. These data are generated from the large object dataset COCO, covering everyday life scenes rather than just scientific questions. 3. **New framework Enigma-COT**: In addition to creating a new dataset, the paper also proposes a new framework called Enigma-COT, which includes two innovative techniques: a multi-hop cross-modal attention mechanism and sentence-level contrastive learning. These techniques aim to enhance the capabilities of image and text encoders, thereby improving the model's performance on multimodal reasoning tasks. Through extensive experimental validation, the researchers demonstrated the effectiveness of the new dataset and methods, providing valuable insights and technical support for the further development of the multimodal reasoning field.