Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Xiaoyuan Li,Wenjie Wang,Moxin Li,Junrong Guo,Yang Zhang,Fuli Feng
2024-06-02
Abstract:The rapid advancement of Large Language Models (LLMs) in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving from the examinee perspective, overlooking a dual perspective of examiner regarding error identification and correction. From the examiner perspective, we define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. We also design diverse prompts to thoroughly evaluate eleven representative LLMs. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro. Notably, calculation error proves the most challenging error type. Moreover, prompting LLMs with the error types can improve the average correction accuracy by 47.9\%. These results reveal potential directions for developing the mathematical reasoning abilities of LLMs. Our code and dataset is available on <a class="link-external link-https" href="https://github.com/LittleCirc1e/EIC" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in mathematical reasoning, especially from the perspective of examiners, focusing on error identification and correction capabilities. Existing evaluations mainly focus on assessing the problem - solving abilities of LLMs from the perspective of examinees, that is, the correctness of answers and the consistency of intermediate reasoning steps, but rarely involve error identification and correction capabilities from the perspective of examiners. Therefore, this paper defines four tasks to comprehensively evaluate the error identification and correction capabilities of LLMs in mathematical reasoning, and constructs a new dataset, which contains annotated error types and steps, to thoroughly evaluate the performance of 11 representative LLMs. ### Four evaluation tasks: 1. **Error - Presence Identification (EP)**: Determine whether there are any errors in the problem - solving process. 2. **Error - Step Identification (ES)**: Identify the first wrong step in the problem - solving process, which is the root cause of the error. 3. **Error - Type Identification (ET)**: Identify the error type in the first wrong step, such as a calculation error. 4. **Error Correction (EC)**: Correct the wrong step and obtain the final correct answer. ### Main findings: 1. **GPT - 4 performs best**: Among all four tasks, GPT - 4 outperforms other models, followed by GLM - 4. GPT - 3.5, Gemini Pro and LLaMA - 2 - 7B have their own advantages and disadvantages in different tasks. 2. **Calculation errors are the most difficult to identify and correct**: Although GPT - 4 and GLM - 4 perform well overall, they perform poorly in identifying and correcting calculation errors, indicating that the computational capabilities of LLMs need to be further enhanced. 3. **Challenges in error - type identification**: In the ET task, many error types are easily misidentified as calculation errors, and "missing step" is the most difficult error type to identify. 4. **Providing error - type information can significantly improve accuracy**: In the EC and ES tasks, by providing error - type information, the average accuracy is increased by 47.9% and 45.9% respectively. 5. **Open - source models are sensitive to prompt words**: The performance of open - source models highly depends on prompt words, while closed - source models show stronger robustness. ### Contributions: 1. **Defined four tasks**: For the first time, comprehensively evaluated the fine - grained capabilities of LLMs in error identification and correction. 2. **Defined nine common error types**: And provided a dataset based on these error types to more meticulously evaluate the performance of LLMs in handling different error scenarios. 3. **Comprehensively evaluated multiple LLMs**: Through a comprehensive evaluation of four commercial models and seven open - source models, obtained useful insights for the subsequent development of LLMs. Through these contributions, this paper not only fills the gaps in existing research but also provides directions for the future development of LLMs.