Abstract:Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at <a class="link-external link-https" href="https://github.com/We-Math/We-Math" rel="external noopener nofollow">this https URL</a>.

Maths: Multimodal Transformer-Based Human-Readable Solver

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Specialized Mathematical Solving by a Step-By-Step Expression Chain Generation

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

Solving Math Word Problems by Combining Language Models With Symbolic Solvers

Describe-then-Reason: Improving Multimodal Mathematical Reasoning through Visual Comprehension Training

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning

Multi-tool Integration Application for Math Reasoning Using Large Language Model

Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations

Measuring Mathematical Problem Solving With the MATH Dataset

AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs