Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang,Lei Li,Zhihong Shao,R. X. Xu,Damai Dai,Yifei Li,Deli Chen,Y. Wu,Zhifang Sui
2023-12-14
Abstract:In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.
Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is the poor performance of large language models (LLMs) in complex multi-step mathematical reasoning tasks. Specifically, existing LLMs tend to make errors when handling mathematical problems that require multi-step reasoning, and relying on a single output result may not be reliable enough. To address these issues, the paper proposes an innovative process-oriented mathematical process reward model named MATH-SHEPHERD, which can assign a reward score to each step of the solution to a mathematical problem. ### Main Contributions: 1. **Automatic Construction of Process Supervision Dataset**: A framework is proposed to automatically construct a process supervision dataset without the need for manual annotation. 2. **Validation and Reinforcement Learning**: The effectiveness of MATH-SHEPHERD is evaluated in both validation and reinforcement learning scenarios. 3. **Key Factor Analysis**: Experiments analyze the key factors in training a high-performance process reward model, providing directions for future improvements in reasoning capabilities. ### Solution: - **Process Reward Model (PRM)**: Unlike the Outcome Reward Model (ORM), PRM can evaluate the reasoning path step by step, providing more detailed feedback. - **Automatic Process Annotation**: Utilizing the idea of Monte Carlo Tree Search (MCTS), the quality of each reasoning step is defined by its potential to derive the correct answer. By completing multiple subsequent reasoning paths and verifying the correctness of the final answer, the training dataset is automatically constructed. - **Validation and Reinforcement Learning**: In the validation scenario, MATH-SHEPHERD is used to reorder multiple outputs generated by LLMs; in the reinforcement learning scenario, MATH-SHEPHERD is used to progressively reinforce LLMs, improving their reasoning accuracy. ### Experimental Results: - **Validation**: MATH-SHEPHERD outperforms self-consistency and ORM on the GSM8K and MATH datasets. - **Reinforcement Learning**: Using MATH-SHEPHERD for step-by-step PPO reinforcement learning significantly improves the accuracy of LLMs on the GSM8K and MATH datasets. ### Conclusion: MATH-SHEPHERD effectively enhances the performance of LLMs in complex multi-step mathematical reasoning tasks by automatically constructing a process supervision dataset. This method not only reduces the reliance on manual annotation but also provides new directions for future improvements in the reasoning capabilities of LLMs.