MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

Hongwei Liu,Zilong Zheng,Yuxuan Qiao,Haodong Duan,Zhiwei Fei,Fengzhe Zhou,Wenwei Zhang,Songyang Zhang,Dahua Lin,Kai Chen
2024-05-21
Abstract:Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, falling short in providing a holistic assessment of the LLMs' math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills. The benchmark progresses through five distinct stages, from basic arithmetic to college mathematics, and is structured to evaluate models at various depths of knowledge. Each stage includes theoretical questions and application problems, allowing us to measure a model's mathematical proficiency and its ability to apply concepts in practical scenarios. MathBench aims to enhance the evaluation of LLMs' mathematical abilities, providing a nuanced view of their knowledge understanding levels and problem solving skills in a bilingual context. The project is released at
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the current shortcomings in the evaluation of large language models (LLMs) in terms of mathematical capabilities. Existing mathematical benchmarks, such as GSM8k, primarily focus on a single dimension of problem-solving ability and lack the capacity to comprehensively and multidimensionally assess the mathematical capabilities of LLMs. These benchmarks often fail to fully reflect the understanding and practical application abilities of LLMs in mathematical knowledge across different levels of difficulty and subject areas. To address this shortcoming, the paper introduces MathBench, a new multilingual benchmark designed to systematically evaluate large language models' theoretical understanding and practical problem-solving abilities across a wide range of mathematical disciplines. MathBench provides a detailed evaluation framework through 5 different stages, from basic arithmetic to college-level mathematics, with each stage containing both theoretical and applied problems to comprehensively measure the model's mathematical capabilities. Specifically, the main contributions of MathBench include: 1. **Introducing a five-level difficulty mechanism** that incorporates a multi-tiered knowledge system. 2. **Covering various types of questions from basic mathematical concepts to real-world application scenarios**. 3. **Conducting extensive experiments** to identify the current bottlenecks of LLMs in solving diverse and complex mathematical problems, and providing new research directions to enhance their mathematical capabilities. Through these designs, MathBench can more comprehensively and meticulously evaluate the mathematical capabilities of LLMs, including not only the understanding of theoretical knowledge but also the ability to apply it in real-world scenarios, thereby providing valuable resources for researchers and developers.