Abstract:Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) with transformative impacts, revolutionizing how these complex problems are approached and solved in various domains including educational settings. However, the evaluation of these models often prioritizes final accuracy, overlooking the crucial aspect of reasoning capabilities. This work addresses this gap by focusing on the ability of LLMs to detect and correct reasoning mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking reveals significant insights into the strengths and weaknesses of state-of-the-art models, such as GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models. Additionally, we identify issues related to data contamination and memorization, impacting the reliability of LLMs in real-world applications. Our findings emphasize the importance of rigorous evaluation of reasoning processes and propose future directions to enhance the generalization and robustness of LLMs in mathematical problem-solving.

What problem does this paper attempt to address?

The problem this paper attempts to address is that while large language models (LLMs) can achieve high accuracy in handling mathematical word problems (MWPs), they lack mathematical reasoning abilities, particularly in detecting and correcting errors in the reasoning process. Current evaluations of these models often focus on the accuracy of the final answer, neglecting the importance of the reasoning process. Therefore, this paper introduces a dataset called MWP-MISTAKE, which includes both correct and incorrect reasoning steps, aiming to comprehensively evaluate LLMs' capabilities in mathematical reasoning, error detection, and correction, and to reveal the strengths and weaknesses of existing models, thereby proposing directions for future improvements. Specifically, the main objectives of the paper include: 1. **Comprehensively evaluating the mathematical reasoning abilities of LLMs**: with a particular focus on their ability to detect and correct errors in the reasoning process. 2. **Identifying the specific strengths and weaknesses of models in handling different types of mathematical challenges**. 3. **Proposing potential directions to enhance the generalization and robustness of LLMs in solving mathematical problems**. To achieve these goals, the researchers developed a new dataset called MWP-MISTAKE, which not only includes correct problem-solving steps but also incorporates incorrect reasoning steps generated through carefully designed rules and smaller language models. By benchmarking multiple state-of-the-art LLMs (such as GPT-4o, GPT-4, GPT-3.5Turbo, etc.), the researchers uncovered some key insights, such as GPT-4o's excellent performance in error detection and correction, while smaller models generally face challenges. Additionally, the study explored the impact of data contamination and memory effects on model performance, emphasizing the importance of rigorous evaluation of these models in practical applications.

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Reasoning in Large Language Models Through Symbolic Math Word Problems

Investigating the Robustness of LLMs on Math Word Problems

Learning From Mistakes Makes LLM Better Reasoner

Evaluating Mathematical Reasoning Beyond Accuracy

From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems

Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of Large Language Models with Misconceptions

Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Fill in the Blank: Exploring and Enhancing LLM Capabilities for Backward Reasoning in Math Word Problems

Benchmarking Large Language Models for Math Reasoning Tasks

VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

Easy Problems That LLMs Get Wrong

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective