Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Joykirat Singh,Akshay Nambi,Vibhav Vineet
2024-06-16
Abstract:Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) with transformative impacts, revolutionizing how these complex problems are approached and solved in various domains including educational settings. However, the evaluation of these models often prioritizes final accuracy, overlooking the crucial aspect of reasoning capabilities. This work addresses this gap by focusing on the ability of LLMs to detect and correct reasoning mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking reveals significant insights into the strengths and weaknesses of state-of-the-art models, such as GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models. Additionally, we identify issues related to data contamination and memorization, impacting the reliability of LLMs in real-world applications. Our findings emphasize the importance of rigorous evaluation of reasoning processes and propose future directions to enhance the generalization and robustness of LLMs in mathematical problem-solving.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is that while large language models (LLMs) can achieve high accuracy in handling mathematical word problems (MWPs), they lack mathematical reasoning abilities, particularly in detecting and correcting errors in the reasoning process. Current evaluations of these models often focus on the accuracy of the final answer, neglecting the importance of the reasoning process. Therefore, this paper introduces a dataset called MWP-MISTAKE, which includes both correct and incorrect reasoning steps, aiming to comprehensively evaluate LLMs' capabilities in mathematical reasoning, error detection, and correction, and to reveal the strengths and weaknesses of existing models, thereby proposing directions for future improvements. Specifically, the main objectives of the paper include: 1. **Comprehensively evaluating the mathematical reasoning abilities of LLMs**: with a particular focus on their ability to detect and correct errors in the reasoning process. 2. **Identifying the specific strengths and weaknesses of models in handling different types of mathematical challenges**. 3. **Proposing potential directions to enhance the generalization and robustness of LLMs in solving mathematical problems**. To achieve these goals, the researchers developed a new dataset called MWP-MISTAKE, which not only includes correct problem-solving steps but also incorporates incorrect reasoning steps generated through carefully designed rules and smaller language models. By benchmarking multiple state-of-the-art LLMs (such as GPT-4o, GPT-4, GPT-3.5Turbo, etc.), the researchers uncovered some key insights, such as GPT-4o's excellent performance in error detection and correction, while smaller models generally face challenges. Additionally, the study explored the impact of data contamination and memory effects on model performance, emphasizing the importance of rigorous evaluation of these models in practical applications.