MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Jarvis Guo,Tuney Zheng,Yuelin Bai,Bo Li,Yubo Wang,King Zhu,Yizhi Li,Graham Neubig,Wenhu Chen,Xiang Yue
2024-12-07
Abstract:Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of existing open - source multimodal large language models (MLLMs) in reasoning ability. Specifically, current multimodal instruction - tuning datasets are mainly reused from academic datasets (such as VQA, AI2D, ChartQA, etc.). These datasets are usually for simple tasks and only provide phrase - level answers, lacking intermediate reasoning processes. The shortcoming of such datasets is that they cannot effectively stimulate the model to conduct in - depth chain - of - thought (CoT) reasoning, thus limiting the model's performance on tasks requiring complex reasoning. To solve this problem, the author proposes a simple, scalable and cost - effective method to construct a large - scale multimodal instruction - tuning dataset that contains rich intermediate reasoning processes and aims to stimulate CoT reasoning. By using open models, the author creates a dataset containing 12 million instruction - response pairs, covering diverse, reasoning - intensive tasks and providing detailed and faithful reasoning processes. Experimental results show that MLLMs trained on this dataset significantly improve their reasoning ability in multiple benchmark tests, especially achieving state - of - the - art performance on tasks requiring complex reasoning. In addition, the model also shows significant improvement on non - reasoning benchmark tests.