MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Shuai Peng,Di Fu,Liangcai Gao,Xiuqin Zhong,Hongguang Fu,Zhi Tang

2024-08-30

Abstract:The rapid development of large language models (LLMs) has spurred extensive research into their domain-specific capabilities, particularly mathematical reasoning. However, most open-source LLMs focus solely on mathematical reasoning, neglecting the integration with visual injection, despite the fact that many mathematical tasks rely on visual inputs such as geometric diagrams, charts, and function plots. To fill this gap, we introduce \textbf{MultiMath-7B}, a multimodal large language model that bridges the gap between math and vision. \textbf{MultiMath-7B} is trained through a four-stage process, focusing on vision-language alignment, visual and math instruction-tuning, and process-supervised reinforcement learning. We also construct a novel, diverse and comprehensive multimodal mathematical dataset, \textbf{MultiMath-300K}, which spans K-12 levels with image captions and step-wise solutions. MultiMath-7B achieves state-of-the-art (SOTA) performance among open-source models on existing multimodal mathematical benchmarks and also excels on text-only mathematical benchmarks. Our model and dataset are available at {\textcolor{blue}{\url{<a class="link-external link-https" href="https://github.com/pengshuai-rin/MultiMath" rel="external noopener nofollow">this https URL</a>}}}.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

### The Problem the Paper Aims to Solve This paper aims to address the limitations of current large language models (LLMs) in mathematical reasoning tasks, particularly their inadequate performance when integrating visual inputs. Specifically: 1. **Insufficient Multimodal Mathematical Reasoning Ability**: Most open-source LLMs primarily focus on mathematical reasoning with pure text, neglecting the integration of visual inputs, even though many mathematical tasks require support from visual information such as geometric figures, charts, and functions. 2. **Limitations of Existing Models**: Specialized mathematical LLMs (e.g., G-LLaVA) perform well on geometric problems but poorly on other multimodal mathematical reasoning tasks; other models (e.g., Math-LLaVA) also show unsatisfactory performance in pure text mathematical reasoning tasks. To address these issues, the researchers propose MultiMath-7B, a multimodal large language model capable of effectively handling various mathematical reasoning tasks and excelling in both multimodal and pure text mathematical benchmarks. Additionally, the paper constructs a new multimodal mathematical dataset, MultiMath-300K, for training and evaluating the model. Through this approach, MultiMath-7B not only achieves the best performance in multimodal mathematical reasoning tasks but also excels in pure text mathematical reasoning tasks.

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Maths: Multimodal Transformer-Based Human-Readable Solver

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models