Abstract:Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the insufficient utilization of visual information in current Multimodal Large Language Models (MLLMs) for solving mathematical problems, especially for non-geometric mathematical problems. Specifically: 1. **Insufficient diversity of visual information**: Existing MLLMs specialized in mathematics mainly focus on solving geometric problems, neglecting the diversity and importance of visual information in other areas of mathematics (such as arithmetic, statistics, algebra, and word problems). 2. **Limitations of datasets**: The current fine-tuning datasets used for specialized mathematical MLLMs usually come from a few public datasets, which have limitations in diversity and complexity, restricting the model's ability to solve a broader range of mathematical problems. 3. **Inadequate ability to handle multiple image inputs**: Existing specialized mathematical MLLMs are primarily designed to handle single image inputs, lacking the ability to process multiple images simultaneously, which limits their capability to solve complex problems that require integrating information from multiple visual sources. To address these issues, the paper proposes the following solutions: - **Constructed a fine-tuning dataset named MathVL**: This dataset includes not only open-source data but also Chinese data specifically collected from the K12 education level in China, covering various types and difficulty levels of mathematical problems, aiming to enhance the model's ability to process visual and textual information. - **Developed a series of specialized mathematical multimodal large language models named MathGLM-Vision**: By performing supervised fine-tuning (SFT) on the MathVL dataset, these models are optimized based on backbone models of different parameter scales to improve their ability to solve complex mathematical problems that include visual elements. - **Conducted extensive experimental evaluations**: Experiments were conducted on multiple public benchmark datasets and the self-built MathVL-test dataset to verify the effectiveness of MathGLM-Vision. The experimental results show that MathGLM-Vision significantly outperforms existing models in solving mathematical problems that include visual information, especially excelling in the fields of geometry and statistics. In summary, this paper aims to enhance the performance and diversity of models in solving complex mathematical problems that include visual information by constructing high-quality multimodal datasets and developing specialized mathematical multimodal models.

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

Visualization Literacy of Multimodal Large Language Models: A Comparative Study

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data