Abstract:Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems, but they struggle with mathematical diagrams since they are primarily trained on natural scene images. For humans, visual aids generally enhance problem-solving, but MLLMs perform worse as information shifts from textual to visual modality. This decline is mainly due to their shortcomings in aligning images and text. To tackle aforementioned challenges, we propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment. This approach is designed to improve the mathematical reasoning skills of MLLMs through a three-stage training process, with the second stage being the critical alignment stage. We first enhance the language model's mathematical reasoning capabilities with extensive set of textual mathematical problems. We then construct a multimodal dataset with varying degrees of textual and visual information, creating data pairs by presenting each problem in at least two forms. By leveraging the Kullback-Leibler (KL) divergence of next-token prediction distributions to align visual and textual modalities, consistent problem-solving abilities are ensured. Finally, we utilize multimodal instruction tuning for MLLMs with high-quality multimodal data. Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities. The code and data are available at: \url{<a class="link-external link-https" href="https://github.com/wwzhuang01/Math-PUMA" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

This paper attempts to solve the problem that multimodal large language models (MLLMs) perform poorly when dealing with math problems, especially when images are involved. Specifically, MLLMs perform well in solving text - based math problems, but face challenges when dealing with math diagrams because these models are mainly trained on natural - scene images. This training method results in the model's deficiency in aligning image and text information, thus affecting its problem - solving ability. ### Main problem summary: 1. **Modal difference**: There is a significant performance gap in MLLMs when dealing with math problems of different modalities (text vs. image). When the information shifts from text to visual modality, the model's performance will decline. 2. **Alignment difficulty**: Existing models have difficulty in effectively aligning image and text information, which limits their performance in multimodal math reasoning tasks. 3. **Domain difference**: Since the training data mainly comes from natural - scene images rather than abstract math graphics, the model has a poor performance in dealing with math diagrams. To solve these problems, the author proposes Math - PUMA (Progressive Upward Multimodal Alignment), a method to enhance the math reasoning ability of MLLMs through progressive upward multimodal alignment. This method gradually improves the model's performance in dealing with multimodal math problems through a three - stage training process, especially narrowing the performance gap between different modalities. ### Core ideas of Math - PUMA: - **Stage 1**: Train the language model with a large number of text - based math problem datasets to enhance its math reasoning ability. - **Stage 2**: Construct data pairs containing different modal representations and align the output distributions of visual and text modalities through Kullback - Leibler (KL) divergence to ensure consistent problem - solving ability. - **Stage 3**: Use high - quality multimodal data for instruction fine - tuning to further improve the model's performance in multimodal math reasoning tasks. Through this method, Math - PUMA effectively improves the performance of MLLMs in multimodal math reasoning tasks and significantly narrows the performance gap between different modalities.

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

Maths: Multimodal Transformer-Based Human-Readable Solver

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation