Abstract:Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at <a class="link-external link-https" href="https://github.com/We-Math/We-Math" rel="external noopener nofollow">this https URL</a>.

PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems

A Mathematical Benchmark for Inductive Theorem Provers

HARP: A challenging human-annotated math reasoning benchmark

TheoremQA: A Theorem-driven Question Answering dataset

ProcessBench: Identifying Process Errors in Mathematical Reasoning

CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

miniCodeProps: a Minimal Benchmark for Proving Code Properties

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Benchmarking Large Language Models for Math Reasoning Tasks

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

TuringQ: Benchmarking AI Comprehension in Theory of Computation

Can Language Models Solve Olympiad Programming?

Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming