Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

Zhenyu Wu,Qingkai Zeng,Zhihan Zhang,Zhaoxuan Tan,Chao Shen,Meng Jiang
2024-10-17
Abstract:Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. However, this repeated independent process often leads to the same mistakes, making the selected solution still incorrect. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. It iterates verification and revision phases that employ a process-supervised verifier. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate. With StepCo, a series of LLMs demonstrate exceptional performance. Notably, using GPT-4o as the backend LLM, StepCo achieves an average accuracy of 94.1 across eight datasets, significantly outperforming the state-of-the-art Best-of-N method by +2.4, while reducing token consumption by 77.8%.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of repeated errors in mathematical reasoning tasks by large language models (LLMs). Specifically, the existing Best-of-N decoding methods generate multiple solutions, score each one, and select the highest-scoring solution as the final answer. However, this independent repetition process often leads to the same errors, resulting in an incorrect final solution. To solve this problem, the authors propose a new method called Stepwise Correction (STEPCO). STEPCO helps LLMs identify and correct erroneous steps in the generated reasoning paths through an iterative verification and revision process. This method not only improves the correctness of the answers but also reduces the number of tokens required, thereby enhancing efficiency. ### Main Contributions 1. **Stepwise Correction Framework**: - STEPCO employs an iterative verification and revision process to identify and correct erroneous steps in the reasoning paths generated by LLMs, significantly improving the accuracy of mathematical reasoning tasks. 2. **Automatic Process Annotation Method**: - An automatic process annotation method is proposed to construct a process supervision dataset, training a Process Supervision Verifier (PSV) that can accurately identify erroneous steps in the reasoning paths generated by LLMs. 3. **Extensive Experimental Evaluation**: - STEPCO is evaluated on eight mathematical reasoning benchmark datasets and extended to open-domain question answering and commonsense reasoning tasks. Experimental results show significant performance improvements for both black-box LLMs and open-source LLMs. ### Method Overview 1. **Initial Answer and Path Generation**: - Initially, an LLM is prompted to generate an initial path containing multi-step reasoning. 2. **Stepwise Correction Process**: - Iterative Verification and Revision Process: - **Verification Phase**: Use PSV to estimate the correctness probability of each step in the reasoning path, identifying the first potentially erroneous step below a preset threshold. - **Revision Phase**: Retain the steps before the erroneous step, provide the potentially erroneous step and its probability as feedback, and instruct the LLM to revise these erroneous steps to improve the final answer's correctness. 3. **Process Supervision Verifier**: - Construct a process supervision dataset using the automatic process annotation method and train the PSV model to accurately identify erroneous steps in the reasoning paths generated by LLMs. ### Experimental Results - **Mathematical Reasoning Tasks**: - On eight mathematical reasoning datasets, STEPCO significantly outperforms existing direct generation baselines, correction-based baselines, and sampling selection baselines. Specifically, when GPT-4o is used as the backend LLM, STEPCO achieves an average accuracy of 94.1%, 2.4 percentage points higher than the best sampling selection method (Best-of-10), while reducing token consumption by 77.8%. - **Non-Mathematical Reasoning Tasks**: - Although primarily trained on mathematical reasoning tasks, STEPCO's Process Supervision Verifier also performs well on non-mathematical reasoning tasks, outperforming other baseline methods on datasets such as HotpotQA and CSQA. - **Different Difficulty Levels**: - As the difficulty of the problems increases, the accuracy of all methods decreases, but STEPCO maintains high accuracy even on high-difficulty problems, demonstrating its advantage in handling complex problems. ### Conclusion STEPCO significantly improves the accuracy of mathematical reasoning tasks and reduces token consumption by iteratively identifying and correcting erroneous steps in the reasoning paths generated by LLMs. This method not only excels in mathematical reasoning tasks but also shows broad application potential in non-mathematical reasoning tasks.