Abstract:The rapid advancement of Large Language Models (LLMs) in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving from the examinee perspective, overlooking a dual perspective of examiner regarding error identification and correction. From the examiner perspective, we define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. We also design diverse prompts to thoroughly evaluate eleven representative LLMs. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro. Notably, calculation error proves the most challenging error type. Moreover, prompting LLMs with the error types can improve the average correction accuracy by 47.9\%. These results reveal potential directions for developing the mathematical reasoning abilities of LLMs. Our code and dataset is available on <a class="link-external link-https" href="https://github.com/LittleCirc1e/EIC" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in mathematical reasoning, especially from the perspective of examiners, focusing on error identification and correction capabilities. Existing evaluations mainly focus on assessing the problem - solving abilities of LLMs from the perspective of examinees, that is, the correctness of answers and the consistency of intermediate reasoning steps, but rarely involve error identification and correction capabilities from the perspective of examiners. Therefore, this paper defines four tasks to comprehensively evaluate the error identification and correction capabilities of LLMs in mathematical reasoning, and constructs a new dataset, which contains annotated error types and steps, to thoroughly evaluate the performance of 11 representative LLMs. ### Four evaluation tasks: 1. **Error - Presence Identification (EP)**: Determine whether there are any errors in the problem - solving process. 2. **Error - Step Identification (ES)**: Identify the first wrong step in the problem - solving process, which is the root cause of the error. 3. **Error - Type Identification (ET)**: Identify the error type in the first wrong step, such as a calculation error. 4. **Error Correction (EC)**: Correct the wrong step and obtain the final correct answer. ### Main findings: 1. **GPT - 4 performs best**: Among all four tasks, GPT - 4 outperforms other models, followed by GLM - 4. GPT - 3.5, Gemini Pro and LLaMA - 2 - 7B have their own advantages and disadvantages in different tasks. 2. **Calculation errors are the most difficult to identify and correct**: Although GPT - 4 and GLM - 4 perform well overall, they perform poorly in identifying and correcting calculation errors, indicating that the computational capabilities of LLMs need to be further enhanced. 3. **Challenges in error - type identification**: In the ET task, many error types are easily misidentified as calculation errors, and "missing step" is the most difficult error type to identify. 4. **Providing error - type information can significantly improve accuracy**: In the EC and ES tasks, by providing error - type information, the average accuracy is increased by 47.9% and 45.9% respectively. 5. **Open - source models are sensitive to prompt words**: The performance of open - source models highly depends on prompt words, while closed - source models show stronger robustness. ### Contributions: 1. **Defined four tasks**: For the first time, comprehensively evaluated the fine - grained capabilities of LLMs in error identification and correction. 2. **Defined nine common error types**: And provided a dataset based on these error types to more meticulously evaluate the performance of LLMs in handling different error scenarios. 3. **Comprehensively evaluated multiple LLMs**: Through a comprehensive evaluation of four commercial models and seven open - source models, obtained useful insights for the subsequent development of LLMs. Through these contributions, this paper not only fills the gaps in existing research but also provides directions for the future development of LLMs.

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Evaluating Mathematical Reasoning Beyond Accuracy

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Embedding Self-Correction as an Inherent Ability in Large Language Models for Enhanced Mathematical Reasoning

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

Benchmarking Large Language Models for Math Reasoning Tasks

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of Large Language Models with Misconceptions

Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems

MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline

A Careful Examination of Large Language Model Performance on Grade School Arithmetic