FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

Yan Liu,Renren Jin,Ling Shi,Zheng Yao,Deyi Xiong
2024-09-06
Abstract:To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the current shortcomings in evaluating the mathematical reasoning abilities of large language models (LLMs). Specifically: 1. **Lack of detailed evaluation criteria**: Existing evaluation datasets typically only provide simple accuracy as a metric, which cannot comprehensively reflect the LLMs' mastery of different mathematical concepts and skills. 2. **Limitations of evaluation datasets**: Current Chinese mathematical evaluation datasets are mainly categorized by grade level, lacking detailed classification of the difficulty of mathematical problems, making it difficult to deeply analyze the performance of LLMs at different difficulty levels. 3. **Imperfect evaluation methods**: Existing evaluation methods often overlook some important factors in the evaluation process, such as the choice of prompts and the form of evaluation, which can significantly affect the model's performance. To address these issues, the paper proposes a fine-grained mathematical evaluation benchmark dataset named FineMath, aimed at comprehensively evaluating the mathematical reasoning abilities of Chinese LLMs. The FineMath dataset covers the main concepts of elementary school mathematics and subdivides these concepts into 17 categories of mathematical application problems. Each category of problems is further divided into three difficulty levels based on the number of reasoning steps required to solve them. In this way, FineMath can more meticulously evaluate the performance of LLMs across different mathematical concepts and difficulty levels.