Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

Jingxuan Wei,Cheng Tan,Zhangyang Gao,Linzhuang Sun,Siyuan Li,Bihui Yu,Ruifeng Guo,Stan Z. Li

2023-09-25

Abstract:Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence, especially when tackling complex tasks. While the chain-of-thought (CoT) technique has gained considerable attention, the existing ScienceQA dataset, which focuses on multimodal scientific questions and explanations from elementary and high school textbooks, lacks a comprehensive evaluation of diverse approaches. To address this gap, we present COCO Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions, rationales, and answers derived from the large object dataset COCO. Unlike previous datasets that rely on multiple-choice questions, our dataset pioneers the use of open-ended questions in the context of multimodal CoT, introducing a more challenging problem that effectively assesses the reasoning capability of CoT models. Through comprehensive evaluations and detailed analyses, we provide valuable insights and propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders. Extensive experiments demonstrate the efficacy of the proposed dataset and techniques, offering novel perspectives for advancing multimodal reasoning. The data and code are available at \href{<a class="link-external link-https" href="https://github.com/weijingxuan/COCO-MMR" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/weijingxuan/COCO-MMR" rel="external noopener nofollow">this https URL</a>}.

Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address several key issues in the field of multimodal reasoning and proposes a new dataset and framework to advance this area. Specifically: 1. **Problems with existing datasets**: The existing ScienceQA dataset, while focusing on multimodal scientific questions and their explanations, has limitations such as small scale, reliance on multiple-choice questions, and being confined to scientific question reasoning. 2. **New dataset COCO-MMR**: To overcome these limitations, the researchers developed the COCO-MMR dataset, which is three times larger than the ScienceQA dataset, containing approximately 62,000 open-ended questions, rationales, and answers. These data are generated from the large object dataset COCO, covering everyday life scenes rather than just scientific questions. 3. **New framework Enigma-COT**: In addition to creating a new dataset, the paper also proposes a new framework called Enigma-COT, which includes two innovative techniques: a multi-hop cross-modal attention mechanism and sentence-level contrastive learning. These techniques aim to enhance the capabilities of image and text encoders, thereby improving the model's performance on multimodal reasoning tasks. Through extensive experimental validation, the researchers demonstrated the effectiveness of the new dataset and methods, providing valuable insights and technical support for the further development of the multimodal reasoning field.

Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Towards a Unified Multimodal Reasoning Framework

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

MEmoR: A Dataset for Multimodal Emotion Reasoning in Videos

Multimodal Chain-of-Thought Reasoning in Language Models

MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps

M^3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

A Survey on Interpretable Cross-modal Reasoning

Towards Robust Multi-Modal Reasoning via Model Selection

M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark