Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang,Junting Pan,Weikang Shi,Zimu Lu,Mingjie Zhan,Hongsheng Li
2024-02-23
Abstract:Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models approaching human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on MATH-V, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development. The project is available at
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning,History and Overview
What problem does this paper attempt to address?
This paper focuses on the gap between current Large Multimodal Models (LMMs) and human performance in assessing mathematical reasoning ability. Researchers have found limitations in current benchmark tests, including limited diversity of question types and coverage of mathematical disciplines. To address this issue, they propose a new dataset, MATH-Vision (MATH-V), which consists of 3040 high-quality math problems from real math competitions, covering 16 different mathematical domains and 5 difficulty levels. This dataset aims to comprehensively evaluate the mathematical reasoning abilities of LMMs. The paper demonstrates through experiments that although some LMMs approach human-level performance on certain tasks, they still significantly lag behind humans on MATH-V. Furthermore, the paper provides a detailed categorization analysis of errors, pointing out the deficiencies of current state-of-the-art LMMs in handling the invariance properties of geometric objects, such as understanding under continuous deformations. This indicates that the mathematical reasoning abilities of LMMs have not yet reached human level and there is still considerable room for improvement. In conclusion, the main contribution of this paper is the proposal of a more comprehensive and challenging benchmark for mathematical reasoning, identification of limitations in existing models, and guidance for future research and development.