Abstract:In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

What problem does this paper attempt to address?

The problem this paper attempts to address is the poor performance of large language models (LLMs) in complex multi-step mathematical reasoning tasks. Specifically, existing LLMs tend to make errors when handling mathematical problems that require multi-step reasoning, and relying on a single output result may not be reliable enough. To address these issues, the paper proposes an innovative process-oriented mathematical process reward model named MATH-SHEPHERD, which can assign a reward score to each step of the solution to a mathematical problem. ### Main Contributions: 1. **Automatic Construction of Process Supervision Dataset**: A framework is proposed to automatically construct a process supervision dataset without the need for manual annotation. 2. **Validation and Reinforcement Learning**: The effectiveness of MATH-SHEPHERD is evaluated in both validation and reinforcement learning scenarios. 3. **Key Factor Analysis**: Experiments analyze the key factors in training a high-performance process reward model, providing directions for future improvements in reasoning capabilities. ### Solution: - **Process Reward Model (PRM)**: Unlike the Outcome Reward Model (ORM), PRM can evaluate the reasoning path step by step, providing more detailed feedback. - **Automatic Process Annotation**: Utilizing the idea of Monte Carlo Tree Search (MCTS), the quality of each reasoning step is defined by its potential to derive the correct answer. By completing multiple subsequent reasoning paths and verifying the correctness of the final answer, the training dataset is automatically constructed. - **Validation and Reinforcement Learning**: In the validation scenario, MATH-SHEPHERD is used to reorder multiple outputs generated by LLMs; in the reinforcement learning scenario, MATH-SHEPHERD is used to progressively reinforce LLMs, improving their reasoning accuracy. ### Experimental Results: - **Validation**: MATH-SHEPHERD outperforms self-consistency and ORM on the GSM8K and MATH datasets. - **Reinforcement Learning**: Using MATH-SHEPHERD for step-by-step PPO reinforcement learning significantly improves the accuracy of LLMs on the GSM8K and MATH datasets. ### Conclusion: MATH-SHEPHERD effectively enhances the performance of LLMs in complex multi-step mathematical reasoning tasks by automatically constructing a process supervision dataset. This method not only reduces the reliance on manual annotation but also provides new directions for future improvements in the reasoning capabilities of LLMs.

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Let's Verify Step by Step

AlphaMath Almost Zero: Process Supervision without Process

Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision

Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

AutoPSV: Automated Process-Supervised Verifier

SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models

Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses

Improving Large Language Model Fine-tuning for Solving Math Problems

Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning

S$^3$c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever

LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

Solving Math Word Problems by Combining Language Models With Symbolic Solvers

S^3c-Math: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems