FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

Yan Liu,Renren Jin,Ling Shi,Zheng Yao,Deyi Xiong

2024-09-06

Abstract:To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the current shortcomings in evaluating the mathematical reasoning abilities of large language models (LLMs). Specifically: 1. **Lack of detailed evaluation criteria**: Existing evaluation datasets typically only provide simple accuracy as a metric, which cannot comprehensively reflect the LLMs' mastery of different mathematical concepts and skills. 2. **Limitations of evaluation datasets**: Current Chinese mathematical evaluation datasets are mainly categorized by grade level, lacking detailed classification of the difficulty of mathematical problems, making it difficult to deeply analyze the performance of LLMs at different difficulty levels. 3. **Imperfect evaluation methods**: Existing evaluation methods often overlook some important factors in the evaluation process, such as the choice of prompts and the form of evaluation, which can significantly affect the model's performance. To address these issues, the paper proposes a fine-grained mathematical evaluation benchmark dataset named FineMath, aimed at comprehensively evaluating the mathematical reasoning abilities of Chinese LLMs. The FineMath dataset covers the main concepts of elementary school mathematics and subdivides these concepts into 17 categories of mathematical application problems. Each category of problems is further divided into three difficulty levels based on the number of reasoning steps required to solve them. In this way, FineMath can more meticulously evaluate the performance of LLMs across different mathematical concepts and difficulty levels.

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Evaluating Mathematical Reasoning Beyond Accuracy

MARIO Eval: Evaluate Your Math LLM with your Math LLM--A mathematical dataset evaluation toolkit

Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

Benchmarking Large Language Models for Math Reasoning Tasks

Large Language Models for Mathematical Reasoning: Progresses and Challenges

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?

Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist