MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Jarvis Guo,Tuney Zheng,Yuelin Bai,Bo Li,Yubo Wang,King Zhu,Yizhi Li,Graham Neubig,Wenhu Chen,Xiang Yue

2024-12-07

Abstract:Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of existing open - source multimodal large language models (MLLMs) in reasoning ability. Specifically, current multimodal instruction - tuning datasets are mainly reused from academic datasets (such as VQA, AI2D, ChartQA, etc.). These datasets are usually for simple tasks and only provide phrase - level answers, lacking intermediate reasoning processes. The shortcoming of such datasets is that they cannot effectively stimulate the model to conduct in - depth chain - of - thought (CoT) reasoning, thus limiting the model's performance on tasks requiring complex reasoning. To solve this problem, the author proposes a simple, scalable and cost - effective method to construct a large - scale multimodal instruction - tuning dataset that contains rich intermediate reasoning processes and aims to stimulate CoT reasoning. By using open models, the author creates a dataset containing 12 million instruction - response pairs, covering diverse, reasoning - intensive tasks and providing detailed and faithful reasoning processes. Experimental results show that MLLMs trained on this dataset significantly improve their reasoning ability in multiple benchmark tests, especially achieving state - of - the - art performance on tasks requiring complex reasoning. In addition, the model also shows significant improvement on non - reasoning benchmark tests.

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models

MAmmoTH2: Scaling Instructions from the Web

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Dual Instruction Tuning with Large Language Models for Mathematical Reasoning

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Multimodal Chain-of-Thought Reasoning in Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

System-2 Mathematical Reasoning via Enriched Instruction Tuning

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

Improve Vision Language Model Chain-of-thought Reasoning

SVIT: Scaling up Visual Instruction Tuning