Abstract:Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, falling short in providing a holistic assessment of the LLMs' math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills. The benchmark progresses through five distinct stages, from basic arithmetic to college mathematics, and is structured to evaluate models at various depths of knowledge. Each stage includes theoretical questions and application problems, allowing us to measure a model's mathematical proficiency and its ability to apply concepts in practical scenarios. MathBench aims to enhance the evaluation of LLMs' mathematical abilities, providing a nuanced view of their knowledge understanding levels and problem solving skills in a bilingual context. The project is released at

What problem does this paper attempt to address?

The paper attempts to address the current shortcomings in the evaluation of large language models (LLMs) in terms of mathematical capabilities. Existing mathematical benchmarks, such as GSM8k, primarily focus on a single dimension of problem-solving ability and lack the capacity to comprehensively and multidimensionally assess the mathematical capabilities of LLMs. These benchmarks often fail to fully reflect the understanding and practical application abilities of LLMs in mathematical knowledge across different levels of difficulty and subject areas. To address this shortcoming, the paper introduces MathBench, a new multilingual benchmark designed to systematically evaluate large language models' theoretical understanding and practical problem-solving abilities across a wide range of mathematical disciplines. MathBench provides a detailed evaluation framework through 5 different stages, from basic arithmetic to college-level mathematics, with each stage containing both theoretical and applied problems to comprehensively measure the model's mathematical capabilities. Specifically, the main contributions of MathBench include: 1. **Introducing a five-level difficulty mechanism** that incorporates a multi-tiered knowledge system. 2. **Covering various types of questions from basic mathematical concepts to real-world application scenarios**. 3. **Conducting extensive experiments** to identify the current bottlenecks of LLMs in solving diverse and complex mathematical problems, and providing new research directions to enhance their mathematical capabilities. Through these designs, MathBench can more comprehensively and meticulously evaluate the mathematical capabilities of LLMs, including not only the understanding of theoretical knowledge but also the ability to apply it in real-world scenarios, thereby providing valuable resources for researchers and developers.

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Proficiency in 8th Grade Mathematics

Benchmarking Large Language Models for Math Reasoning Tasks

MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

LawBench: Benchmarking Legal Knowledge of Large Language Models

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Mamo: a Mathematical Modeling Benchmark with Solvers

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data