Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang,Junting Pan,Weikang Shi,Zimu Lu,Mingjie Zhan,Hongsheng Li

2024-02-23

Abstract:Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development. The project is available at

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,History and Overview

What problem does this paper attempt to address?

This paper focuses on the gap between current Large Multimodal Models (LMMs) and human performance in assessing mathematical reasoning ability. Researchers have found limitations in current benchmark tests, including limited diversity of question types and coverage of mathematical disciplines. To address this issue, they propose a new dataset, MATH-Vision (MATH-V), which consists of 3040 high-quality math problems from real math competitions, covering 16 different mathematical domains and 5 difficulty levels. This dataset aims to comprehensively evaluate the mathematical reasoning abilities of LMMs. The paper demonstrates through experiments that although some LMMs approach human-level performance on certain tasks, they still significantly lag behind humans on MATH-V. Furthermore, the paper provides a detailed categorization analysis of errors, pointing out the deficiencies of current state-of-the-art LMMs in handling the invariance properties of geometric objects, such as understanding under continuous deformations. This indicates that the mathematical reasoning abilities of LMMs have not yet reached human level and there is still considerable room for improvement. In conclusion, the main contribution of this paper is the proposal of a more comprehensive and challenging benchmark for mathematical reasoning, identification of limitations in existing models, and guidance for future research and development.

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Improving Multimodal LLMs Ability In Geometry Problem Solving, Reasoning, And Multistep Scoring