MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

Zhen Yang,Jinhao Chen,Zhengxiao Du,Wenmeng Yu,Weihan Wang,Wenyi Hong,Zhihuan Jiang,Bin Xu,Yuxiao Dong,Jie Tang
2024-09-10
Abstract:Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem this paper attempts to address is the insufficient utilization of visual information in current Multimodal Large Language Models (MLLMs) for solving mathematical problems, especially for non-geometric mathematical problems. Specifically: 1. **Insufficient diversity of visual information**: Existing MLLMs specialized in mathematics mainly focus on solving geometric problems, neglecting the diversity and importance of visual information in other areas of mathematics (such as arithmetic, statistics, algebra, and word problems). 2. **Limitations of datasets**: The current fine-tuning datasets used for specialized mathematical MLLMs usually come from a few public datasets, which have limitations in diversity and complexity, restricting the model's ability to solve a broader range of mathematical problems. 3. **Inadequate ability to handle multiple image inputs**: Existing specialized mathematical MLLMs are primarily designed to handle single image inputs, lacking the ability to process multiple images simultaneously, which limits their capability to solve complex problems that require integrating information from multiple visual sources. To address these issues, the paper proposes the following solutions: - **Constructed a fine-tuning dataset named MathVL**: This dataset includes not only open-source data but also Chinese data specifically collected from the K12 education level in China, covering various types and difficulty levels of mathematical problems, aiming to enhance the model's ability to process visual and textual information. - **Developed a series of specialized mathematical multimodal large language models named MathGLM-Vision**: By performing supervised fine-tuning (SFT) on the MathVL dataset, these models are optimized based on backbone models of different parameter scales to improve their ability to solve complex mathematical problems that include visual elements. - **Conducted extensive experimental evaluations**: Experiments were conducted on multiple public benchmark datasets and the self-built MathVL-test dataset to verify the effectiveness of MathGLM-Vision. The experimental results show that MathGLM-Vision significantly outperforms existing models in solving mathematical problems that include visual information, especially excelling in the fields of geometry and statistics. In summary, this paper aims to enhance the performance and diversity of models in solving complex mathematical problems that include visual information by constructing high-quality multimodal datasets and developing specialized mathematical multimodal models.