MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Shuai Peng,Di Fu,Liangcai Gao,Xiuqin Zhong,Hongguang Fu,Zhi Tang
2024-08-30
Abstract:The rapid development of large language models (LLMs) has spurred extensive research into their domain-specific capabilities, particularly mathematical reasoning. However, most open-source LLMs focus solely on mathematical reasoning, neglecting the integration with visual injection, despite the fact that many mathematical tasks rely on visual inputs such as geometric diagrams, charts, and function plots. To fill this gap, we introduce \textbf{MultiMath-7B}, a multimodal large language model that bridges the gap between math and vision. \textbf{MultiMath-7B} is trained through a four-stage process, focusing on vision-language alignment, visual and math instruction-tuning, and process-supervised reinforcement learning. We also construct a novel, diverse and comprehensive multimodal mathematical dataset, \textbf{MultiMath-300K}, which spans K-12 levels with image captions and step-wise solutions. MultiMath-7B achieves state-of-the-art (SOTA) performance among open-source models on existing multimodal mathematical benchmarks and also excels on text-only mathematical benchmarks. Our model and dataset are available at {\textcolor{blue}{\url{<a class="link-external link-https" href="https://github.com/pengshuai-rin/MultiMath" rel="external noopener nofollow">this https URL</a>}}}.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### The Problem the Paper Aims to Solve This paper aims to address the limitations of current large language models (LLMs) in mathematical reasoning tasks, particularly their inadequate performance when integrating visual inputs. Specifically: 1. **Insufficient Multimodal Mathematical Reasoning Ability**: Most open-source LLMs primarily focus on mathematical reasoning with pure text, neglecting the integration of visual inputs, even though many mathematical tasks require support from visual information such as geometric figures, charts, and functions. 2. **Limitations of Existing Models**: Specialized mathematical LLMs (e.g., G-LLaVA) perform well on geometric problems but poorly on other multimodal mathematical reasoning tasks; other models (e.g., Math-LLaVA) also show unsatisfactory performance in pure text mathematical reasoning tasks. To address these issues, the researchers propose MultiMath-7B, a multimodal large language model capable of effectively handling various mathematical reasoning tasks and excelling in both multimodal and pure text mathematical benchmarks. Additionally, the paper constructs a new multimodal mathematical dataset, MultiMath-300K, for training and evaluating the model. Through this approach, MultiMath-7B not only achieves the best performance in multimodal mathematical reasoning tasks but also excels in pure text mathematical reasoning tasks.