Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Wenwen Zhuang,Xin Huang,Xiantao Zhang,Jin Zeng
2024-09-25
Abstract:Multimodal Large Language Models (MLLMs) excel in solving text-based mathematical problems, but they struggle with mathematical diagrams since they are primarily trained on natural scene images. For humans, visual aids generally enhance problem-solving, but MLLMs perform worse as information shifts from textual to visual modality. This decline is mainly due to their shortcomings in aligning images and text. To tackle aforementioned challenges, we propose Math-PUMA, a methodology focused on Progressive Upward Multimodal Alignment. This approach is designed to improve the mathematical reasoning skills of MLLMs through a three-stage training process, with the second stage being the critical alignment stage. We first enhance the language model's mathematical reasoning capabilities with extensive set of textual mathematical problems. We then construct a multimodal dataset with varying degrees of textual and visual information, creating data pairs by presenting each problem in at least two forms. By leveraging the Kullback-Leibler (KL) divergence of next-token prediction distributions to align visual and textual modalities, consistent problem-solving abilities are ensured. Finally, we utilize multimodal instruction tuning for MLLMs with high-quality multimodal data. Experimental results on multiple mathematical reasoning benchmarks demonstrate that the MLLMs trained with Math-PUMA surpass most open-source MLLMs. Our approach effectively narrows the performance gap for problems presented in different modalities. The code and data are available at: \url{<a class="link-external link-https" href="https://github.com/wwzhuang01/Math-PUMA" rel="external noopener nofollow">this https URL</a>}.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve the problem that multimodal large language models (MLLMs) perform poorly when dealing with math problems, especially when images are involved. Specifically, MLLMs perform well in solving text - based math problems, but face challenges when dealing with math diagrams because these models are mainly trained on natural - scene images. This training method results in the model's deficiency in aligning image and text information, thus affecting its problem - solving ability. ### Main problem summary: 1. **Modal difference**: There is a significant performance gap in MLLMs when dealing with math problems of different modalities (text vs. image). When the information shifts from text to visual modality, the model's performance will decline. 2. **Alignment difficulty**: Existing models have difficulty in effectively aligning image and text information, which limits their performance in multimodal math reasoning tasks. 3. **Domain difference**: Since the training data mainly comes from natural - scene images rather than abstract math graphics, the model has a poor performance in dealing with math diagrams. To solve these problems, the author proposes Math - PUMA (Progressive Upward Multimodal Alignment), a method to enhance the math reasoning ability of MLLMs through progressive upward multimodal alignment. This method gradually improves the model's performance in dealing with multimodal math problems through a three - stage training process, especially narrowing the performance gap between different modalities. ### Core ideas of Math - PUMA: - **Stage 1**: Train the language model with a large number of text - based math problem datasets to enhance its math reasoning ability. - **Stage 2**: Construct data pairs containing different modal representations and align the output distributions of visual and text modalities through Kullback - Leibler (KL) divergence to ensure consistent problem - solving ability. - **Stage 3**: Use high - quality multimodal data for instruction fine - tuning to further improve the model's performance in multimodal math reasoning tasks. Through this method, Math - PUMA effectively improves the performance of MLLMs in multimodal math reasoning tasks and significantly narrows the performance gap between different modalities.